9 Comments

re DuckDB: it has gotten a lot better at larger-than-memory queries. However, in this case, you’re trying to create the database in memory and then run a query on it.

suggest two options:

a) run your aggregation query directly without the CTAS by specifying the table function read_parquet instead of “data”

OR

b) specify a database file so that the database is actually created on disk before trying to execute the aggregation query.

ie con = duckdb.connect(“my.db”) and then con.sql()

Expand full comment

Why are you comparing a database with two data engines? Is that a fair comparison?

Did you only run the tests once? For more accurate results, shouldn’t you conduct at least three runs per test to calculate an average? Variability is common in these types of performance tests, so a single run is not reliable AT ALL.

Do you know what time takes only to initialize both engines (Polars and Daft)?

Do you know which libraries Polars and Daft use to connect to S3? Are they the same or different?

How do these tools handle parallelization? Are there differences in how they distribute calculations across resources?

This article lacks depth and raises more questions than it answers.

Expand full comment

you have literally posted nothing nor contributed to substack at all, yet you criticize someone who takes the time to put this out there. What is stopping you from running this code yourself and answering these questions?

Expand full comment

Disagree. The purpose of the test is out of the box packages on an active EC2 instance to show performance.

Maybe they have different init time, maybe they use different packages to connect to s3, and maybe they distribute calculations completely differently. Its neither here nor there as there little you can do about it. The point is how they perform.

Ive also never seen much difference between runs on the same data unless theres an outside influence to compute - repeating a single job multiple times isnt going to change THAT much. Of course we’re all allowed our opinion and I find it interesting that this arose these questions for you. Have a good day.

Expand full comment

Ok, it's perfectly fine to disagree.

Expand full comment

You are slowly convincing me to take another look at daft :). I do agree that the load_credentials thing for duckdb and was is kind of weird, and its nice how daft and polars just know how to find the session credentials. Bonus points that daft could resolve the S3 destination without the need for the additional library, like polars required. Maybe daft is doing that under the hood though.

Thanks for the complete code examples that you always include. Very easy to follow.

Expand full comment

Thanks Daniel, very interesting tests, trying to do the same on my end to learn about Polars and Daft, I am somewhat familiar with DuckDB, but something does not add up for me with the datahobbit tool, I have downloaded it and compiled it, and when I run datahobbit with only 5mio rows in parquet, I get 1 file of 514MB and one of 7MB which we could ignore for now. If I extrapolate that to what I would need in terms of storage on my linux box for 10 billion rows, I get that if for 5 million rows I get 1 file of 514 MB for 10000 millions rows that would be 10000/5 = 2000 files or 514MB that would mean 1TB storage, now finally to my question: How is it that in your example you get a dataset that is only 16GB???

Expand full comment

Don't get me wrong. I liked the article, it was very interesting indeed. Maybe it's only ... I was expecting much more?

Expand full comment

Interesting to see these experiments, thanks :). Like others I would be keen to see the best practice approach with duckdb (e.g. Ryan's suggestion above), and how that performs. I would have expected that it can handle out of core if you specify a file database.

Expand full comment