An interesting article, thanks for it. However, I think one could expand your thoughts a little bit. :)
First of all, I assume that edge and serverless computing are two big areas where DuckDB could shine. Being able to easily examine a dataset inside an AWS Lambda or similar technology can be very helpful and save quite a lot of money. Of course, this does not apply to all or very big datasets but even quite big ones stored in Parquet format and which are optimized (in the sense of partitioning, sorting, etc...) for envisioned use case can probably be handled quite efficiently by DuckDB. Similar will hold for Polars, but this is another question. :)
Second, comparing DuckDB with Polars on a dataset stored in CSV format is probably not very fair. DuckDB has its own data format which will be optimized for analytical use cases. And I strongly assume that if one converts the CSV data into this internal format, the query performance will increase dramatically. The fact that Polars is so much faster is probably just due to the fact that it does not interprete all columns from the CSV files (which is definitely clever). Sure, one could argue that "in the real world" CSV is very wide-spread... but if one builds a system which needs to quickly query the data, one will not use CSV files but transform the data into something more efficient (like Parquet or this internal DuckDB format).
Also your comment that using big machines is so old-school is true... but for a lot of use cases a sufficiently big machine is just enough and you don't need a cluster. Having a Spark cluster is cool and so... but for a LOT of use cases, you just don't need more than 32, 64, 128 cores. And data locality can bring huge benefits in terms of performance. And if DuckDB is able to utilize all these core then the performance will probably be sufficient for a lot of people.
So, overall, I think there are a big amount of use cases where DuckDB and similar technologies (like DataFusion, Polars, ...) can save a lot of complexity and money. But one needs to understand the pros and contras of the used solutions.
I am a noob in this area, apologies if my question is stupid:
This blog article* from March 2022 shows rather different results and I am wondering what is the explanation?
I am not familiar with the Polars syntax, so I tried to look up ‘pl.scan_csv’.
According to this stackoverflow** answer my understanding is that ‘pl.scan_csv’ does support some kind of caching which can increase the effective reading speed significantly.
Moreover, DuckDB also works with Polars data frames***. I am curious what would be the result when combining ‘pl.scan_csv’ with the aggregation in DuckDB?
edit:
I came across another article**** that refers to the “lazy API” of Polars. Especially the paragraph titled “Be lazy” explains the optimization of ‘scan_csv’ vs. ‘read_csv’ with a quite similar example.
This seems to be the more appropriate explanation for this case than “caching”.
The biggest competitor to DuckDB imho are actually services like AWS Athena. If you want to use SQL and the data is already in the cloud, then why should you choose DuckDB over Athena and the like?
An interesting article, thanks for it. However, I think one could expand your thoughts a little bit. :)
First of all, I assume that edge and serverless computing are two big areas where DuckDB could shine. Being able to easily examine a dataset inside an AWS Lambda or similar technology can be very helpful and save quite a lot of money. Of course, this does not apply to all or very big datasets but even quite big ones stored in Parquet format and which are optimized (in the sense of partitioning, sorting, etc...) for envisioned use case can probably be handled quite efficiently by DuckDB. Similar will hold for Polars, but this is another question. :)
Second, comparing DuckDB with Polars on a dataset stored in CSV format is probably not very fair. DuckDB has its own data format which will be optimized for analytical use cases. And I strongly assume that if one converts the CSV data into this internal format, the query performance will increase dramatically. The fact that Polars is so much faster is probably just due to the fact that it does not interprete all columns from the CSV files (which is definitely clever). Sure, one could argue that "in the real world" CSV is very wide-spread... but if one builds a system which needs to quickly query the data, one will not use CSV files but transform the data into something more efficient (like Parquet or this internal DuckDB format).
Also your comment that using big machines is so old-school is true... but for a lot of use cases a sufficiently big machine is just enough and you don't need a cluster. Having a Spark cluster is cool and so... but for a LOT of use cases, you just don't need more than 32, 64, 128 cores. And data locality can bring huge benefits in terms of performance. And if DuckDB is able to utilize all these core then the performance will probably be sufficient for a lot of people.
So, overall, I think there are a big amount of use cases where DuckDB and similar technologies (like DataFusion, Polars, ...) can save a lot of complexity and money. But one needs to understand the pros and contras of the used solutions.
I think that transforming large dataset into internal DuckDB format is a penalty on its own...
Beautifully written and consise. Thanks for this
Thank you for your article!
I am a noob in this area, apologies if my question is stupid:
This blog article* from March 2022 shows rather different results and I am wondering what is the explanation?
I am not familiar with the Polars syntax, so I tried to look up ‘pl.scan_csv’.
According to this stackoverflow** answer my understanding is that ‘pl.scan_csv’ does support some kind of caching which can increase the effective reading speed significantly.
Moreover, DuckDB also works with Polars data frames***. I am curious what would be the result when combining ‘pl.scan_csv’ with the aggregation in DuckDB?
edit:
I came across another article**** that refers to the “lazy API” of Polars. Especially the paragraph titled “Be lazy” explains the optimization of ‘scan_csv’ vs. ‘read_csv’ with a quite similar example.
This seems to be the more appropriate explanation for this case than “caching”.
.
references:
* https://duckdb.org/2022/03/07/aggregate-hashtable.html
** https://stackoverflow.com/questions/73033580/why-polars-scan-csv-is-even-faster-than-disk-reading-speed
*** https://duckdb.org/docs/guides/python/polars.html
**** https://pola-rs.github.io/polars-book/user-guide/coming_from_pandas.html
The biggest competitor to DuckDB imho are actually services like AWS Athena. If you want to use SQL and the data is already in the cloud, then why should you choose DuckDB over Athena and the like?