Discussion about this post

User's avatar
ryan boyd's avatar

re DuckDB: it has gotten a lot better at larger-than-memory queries. However, in this case, you’re trying to create the database in memory and then run a query on it.

suggest two options:

a) run your aggregation query directly without the CTAS by specifying the table function read_parquet instead of “data”

OR

b) specify a database file so that the database is actually created on disk before trying to execute the aggregation query.

ie con = duckdb.connect(“my.db”) and then con.sql()

Expand full comment
Ángel's avatar

Why are you comparing a database with two data engines? Is that a fair comparison?

Did you only run the tests once? For more accurate results, shouldn’t you conduct at least three runs per test to calculate an average? Variability is common in these types of performance tests, so a single run is not reliable AT ALL.

Do you know what time takes only to initialize both engines (Polars and Daft)?

Do you know which libraries Polars and Daft use to connect to S3? Are they the same or different?

How do these tools handle parallelization? Are there differences in how they distribute calculations across resources?

This article lacks depth and raises more questions than it answers.

Expand full comment
8 more comments...

No posts