In my never-ending quest to make r/dataengineering angry, of which I’ve been quite successful, I’m always on the lookout for Golden Calves and other Idols that I can smash to bits, and wait for the angry masses frothing at the mouth to hunt me down.
Sorry, not sorry, after years of doing this, I know what clickbait is and isn’t. Is that what I’m doing here, pitting DuckDB against Polars? Yes and no.
Sure, some zealots will come to me and say that Polars and DuckDB can live in harmony, that they have different use cases and can co-exist. Yeah, well, so can humans … expcept we don’t and Data Engineers are humans … so … let the battle begin.
I will attempt to be fair to both tools. But, I will be honest and say up front that I think Polars has won the battle or will win it soon enough. The battle that you may think doesn’t exist, yet still rages on in the shadows.
Why Polars vs DuckDB?
As someone who thinks about and practices the art of Data Engineering on a daily basis, I find the rise of new tooling extremely interesting and wonderful.
Most of the new innovations are born out of necessity, trying to solve some problem, and if it’s useful enough to the world at large, it gets popular.
What I also find interesting is when you have two technologies that are similar …
I like to see how they “respond to each other.”
How they adapt to the community after time goes by.
How picks up more stream and who falls behind?
If you look at DuckDB and Polars on the surface … they appear to be very similar. Take for instance this high-level view.
But are they really that similar? I think not.
I would venture to say that the only “similarity” that DuckDB and Polars share is SQL. Other than that, they are morphing into totally different tools.
DuckDB will always have a place in Data Engineering, I mean SQL is SQL. Anything SQL always does ok.
What does the “community” say?
When I’m testing the waters between two tools, whatever they may be, I try to take some time to ignore the talking heads and lick my finger and stick in the air. Taste what’s on the wind.
A healthy tool gets talked about in the open air. On Linkedin, Twitter, Reddit, wherever. You can’t hid a tool that’s being used.
Also, if it’s just marketing fluff being pushed … that never is the same thing as actual users talking about doing actual things with the tool(s).
What have I personally seen when I look at the community talk around DuckDB and Polars? I find people talking a lot about Polars, and how they are actually using it. With DuckDB I find a lot of marketing material, lots of benchmark and testing talk … but very little boots on the ground for day-to-day use.
I mean go look for yourself around Twitter and Linkedin.
All this content I’ve seen around DuckDB isn’t bad. But it’s mostly just talk. I don’t see a lot of people building new tools with DuckDB, talking about it, using it in Production, solving problems, talking more about it.
Maybe I just run in the wrong circles or have been looking under the wrong rocks. But I simply just don’t see and find a lot of Data Engineers using DuckDB … at least not as much as Polars.
If you search Polars, you will find actual things.
Every day users are talking about using Polars. Also, it doesn’t hurt to take a look at both GitHub repositories, not for stars per se (Polars has double the stars of DuckDB), but for the number of open issues.
Open issues indicate how active the community is, and the more people that use the tool, you are going to have a lot of issues being opened.
Polars has 1.3K compared to 223 on DuckDB. That should tell you something.
Polars is turning into a Data Engineering juggernaut.
You might accuse me of being biased, and I am, I’ve already written my fair share of Polars in my free time. I’ve also put it into Production at work, to replace some Spark jobs.
Here is some of the stuff I’ve done with Polars.
So why do I think the winner for the future of Data Engineering is Polars and not DuckDB? Simply put, because you can do a lot more “things” with Polars.
Polars is truly and end-to-end Data Engineering tool in every sense. It can, without problem, and with ease, be used to write entire end-to-end Data Pipelines.
I think you would struggle to do the same thing with DuckDB, I mean you can do what you want, but I don’t think that’ show DuckDB even sells itself.
Being fair to DuckDB.
I know that there have to be a few people ready to ring my little neck by now. That’s fine.
DuckDB is trying to do one thing very well. Being the next SQLite. It’s extremely fast, lightweight, and made to do one thing and one thing only. Fast SQL analytics in memory, in process.
One can argue that is completely different from Polars. And it is. Yet it is not.
What do Data Engineers look for in tool(s)?
Data Engineers are human and follow human patterns when picking tools, and as a whole, using tools that become widely adopted and popular (think Apache Spark for example).
Polars just offers enough of everything to become the number one choice.
If you are building Data Pipelines for a top-tier Engineering organization, with a variety of use cases and problems to solve … you’re simply going to look for a tool that can solve all your problems.
As someone who picks technologies and designs architecture … I need a reason to add complexity, to add another tool.
Polars is based on Rust and is extremely fast. It can work with larger-than-memory datasets. Am I really going to design more complexity into my pipelines and add steps to use DuckDB so I can gain some milliseconds on a few SQL queries?
No, I will not.
The End
Now don’t put words in my mouth. I have nothing against DuckDB, it’s a fine tool, and probably will be very useful to many folks. But it’s going to lose the tool war to Polars, by a lot.
I think DuckDB is a niche tool. Polars is a tool for the masses.
Counting *open* issues as community engagement is nonsense. The total number of issues, open and closed, is 7000 for polars and 5000 duckdb and the difference here can be explained from duckdb offering the GitHub "Discussions" to separate feature requests, questions, and discussions from actual bug reports. (By the way, this agrees with polars' open to closed issue ratio being 5x worse than duckdb's: a lot of their issues aren't really issues and as such can't be addressed/solved easily.)
Do you have any thoughts on using ibis+duckdb? That would fix the "tool to simplify" the use?