Discussion about this post

User's avatar
seryj's avatar

An interesting article, thanks for it. However, I think one could expand your thoughts a little bit. :)

First of all, I assume that edge and serverless computing are two big areas where DuckDB could shine. Being able to easily examine a dataset inside an AWS Lambda or similar technology can be very helpful and save quite a lot of money. Of course, this does not apply to all or very big datasets but even quite big ones stored in Parquet format and which are optimized (in the sense of partitioning, sorting, etc...) for envisioned use case can probably be handled quite efficiently by DuckDB. Similar will hold for Polars, but this is another question. :)

Second, comparing DuckDB with Polars on a dataset stored in CSV format is probably not very fair. DuckDB has its own data format which will be optimized for analytical use cases. And I strongly assume that if one converts the CSV data into this internal format, the query performance will increase dramatically. The fact that Polars is so much faster is probably just due to the fact that it does not interprete all columns from the CSV files (which is definitely clever). Sure, one could argue that "in the real world" CSV is very wide-spread... but if one builds a system which needs to quickly query the data, one will not use CSV files but transform the data into something more efficient (like Parquet or this internal DuckDB format).

Also your comment that using big machines is so old-school is true... but for a lot of use cases a sufficiently big machine is just enough and you don't need a cluster. Having a Spark cluster is cool and so... but for a LOT of use cases, you just don't need more than 32, 64, 128 cores. And data locality can bring huge benefits in terms of performance. And if DuckDB is able to utilize all these core then the performance will probably be sufficient for a lot of people.

So, overall, I think there are a big amount of use cases where DuckDB and similar technologies (like DataFusion, Polars, ...) can save a lot of complexity and money. But one needs to understand the pros and contras of the used solutions.

Expand full comment
Mike Kenneth's avatar

Beautifully written and consise. Thanks for this

Expand full comment
3 more comments...

No posts