What's all the hype with DuckDB?

Introduction to DuckDB for Data Engineers. Performance and other thoughts.

Dec 18, 2022

Have you been as confused as me about DuckDB? My goal with this article is to cut through all the crud and get to the bottom of the issue. I keep seeing the name DuckDB keep popping up in my feeds here and there in my varied and disjoined internet travels. There’s nothing for it but to answer a few questions.

What is DuckDb?
How could a Data Engineer use it?
How is being marketed and used today?
What is the performance like?
Overall thoughts on DuckDb.

It’s hard not to become jaded by the number of tools being released (even those supposedly open-source tools, which are still just a way for some SAAS tools to get more traction), it’s truly overwhelming and hard to keep up.

Is a tool worth it? Will it be here and gone in a year? Will others start using the tool? Could it help streamline my data pipelines?

I see that as part of my service to you, the reader.

“I want to kick the tires on things for you, put in the time, research, and code so you don’t have to. Then give you my unfiltered opinion on that tool, in this case, DuckDB.”

What is DuckDB?

This answer turned out to be much easier to find than I thought it would be. Sometimes it’s tough to cut through all the layers of frosting before you get to the real cake at the core of the issue. I have to admit, DuckDB is positioning itself in an exciting place in the Data Stack, and fortunately, they are very straightforward about who they are, what they do well, and what they don’t do well.

“DuckDB is the OLAP (analytical) version of SQLite.”

I feel like I’m giving away the conclusion from the beginning, but that’s fine. We can spend the rest of the time unpacking why I think DuckDB is the marriage of SQLite and Pandas.

What does DuckDB have to say about itself?

“DuckDB is an in-process SQL OLAP database management system”
- duckdb

It’s funny, after all the DuckDB hype when digging into it, I wouldn’t say there isn’t a single thing that is “earth-shattering,” it’s simply trying to do what SQLite does a little better. Maybe steal away some of those Pandas users.

What are the main points and features of DuckDB?

Simple, clean install, with very little overhead.
Feature-rich with SQL, plus CSV and Parquet support.
In-memory option, high-speed, parallel processing.
Open-source.

“DuckDB is a transient relational database for in-process complex join, aggregation, and high-speed data transformations.”

How could a Data Engineer use DuckDB?

This is an interesting question to answer. I’ve been doing data work for well over a decade and the use case for this sort of tool is most specialized and less common than you think.

There are a few things to remember before using a tool like DuckDB that are very important so you don’t shoot yourself in the old foot per se. You know that word they use to describe DuckDB called “in-process.” Well, you should pay attention to that.

DuckDB isn’t meant to replace MySQL, Postgres, and the rest of those relational databases. In fact, they tell you DuckDB is very BAD at “High-volume transactional use cases,” and “Writing to a single database from multiple concurrent processes.”

What does that mean for Data Engineers?

“Never use DuckDB for anything that needs to last longer than the duration of your data pipeline.”

You might be asking yourself, then what good is DuckDB to me then? I’ve seen a few use cases that might benefit from this tool. Although I think they are more limited than a DuckDB fan might admit.

R&D data exploration and research.
Very specific in-pipeline data transformation and analytics (as an appendix to the main pipeline process.)
???

I’m grasping for straws here. I’m not trying to be down on DuckDB, I’m just struggling to find real use cases that are more than just R&D data exploration. I can see its uses if you have a bunch of data locally in different forms and are doing serious research and data exploration that requires lots of joining and aggregations. At a certain point, DuckDB probably makes sense.

If you really stretch the imagination you might find some specific in-pipeline uses cases where maybe you’re crunching some ancillary data sets and you could use DuckDB to do that. I know it’s supposed to be fast and easy to install, but is it really worth the cognitive overhead of adding yet another tool to your data pipeline? Not sure.

Maybe we can answer this question a little better by understanding how DuckDB has been marketed so far, and find some references to others who are using DuckDB, and getting a better idea of how it actually might be used for Data Engineering.

How is DuckDB being marketed and used today?

There are several sources I want to look at, including DuckDB’s blog, and their Twitter wall of testimonials. See what’s going on in the wild.

I reached out to my network on Linkedin and Mastodon. The response was tepid. Honestly, the stuff on the DuckDB website wasn’t all that specific, other than that it’s faster than SQLite, which I’m sure it is.

There were also a few comments about using it for quick analysis of data, which makes sense.

I did read some ridiculous blog post from DuckDB about how it fits into the Modern Data Stack and could be run on a single large machine and replace a lot of data pipelines and Data Warehouses! That seems a very naive argument for sure. Especially since DuckDB by its own admission should not be used concurrently by multiple processes and is in-process and transient.

Most data folk know that using single large machines simply isn’t a scalable design and will absolutely come back to bite you and is a classic architecture mistake.

“I have yet to find a definitive example of how DuckDB is being used by anyone, let alone Data Engineers for anything other than data exploration and research. Maybe it’s out there, but I’m having trouble finding it.”

With all that talk of performance, which is repeated over and over again, this should be interesting.

Quick DuckDB technical overview.

Honestly, there isn’t that much exciting to share about the technical use of DuckDB. One can simply pip install it with Python for example, and reading datasets is as simple as running something like read_csv_auto('input.csv') embedded in a SQL statement.

After that, you write DuckDB queries much like you would Postgres or MySQL. Join, aggregate, filter, and select, like you would any other SQL statement. I mean, that’s the whole point. You can create tables, copy data, and run queries, it’s simply another SQL manipulation tool. Read the docs if you like.

What is the performance of DuckDB like?

The movement we’ve all been waiting for, a performance test. I’m not going to claim I run perfect and bulletproof performance tests, but I do my best. Install things, then run some normal tasks in a few tools and compare the results.

This code is available on GitHub. We are going to use the Backblaze Hard Drive failures dataset. The dataset consists of three different sets of CSV files. Unpacked this is about 20.71 GB of data.

The data itself consists of hard drive model information, along with dates and if the drive failed or not.

We will simply use DuckDB and Pandas to group the data by date and count the number of failures. So read the data, do a simple groupby and count.

The first thing I will note is that using DuckDB with Python is super easy, almost too easy.

import duckdb
from datetime import datetime

t1 = datetime.now()
conn = duckdb.connect()

conn.execute("""SELECT date, SUM(failure) as failures
                FROM read_csv('data/*/*.csv', delim=',', header=True,  IGNORE_ERRORS=1, 
                    columns={
                            'date': 'DATE',
                            'serial_number' : 'VARCHAR',
                            'model' : 'VARCHAR',
                            'capacity_bytes' : 'VARCHAR',
                            'failure' : 'INT'
                            })
             GROUP BY date;""")
print(conn.fetchall())
t2 = datetime.now()
print("duckdb took {x}".format(x=t2-t1))

The results are in. duckdb took 0:01:09.602031

After posting preliminary results to a few social media networks, someone had a suggestion about using an experimental part of DuckDB, a parallel execution flag.

conn.execute("SET experimental_parallel_csv=TRUE")

This did indeed decrease the runtime, coming in at 0:00:49.865234, a great performance improvement, but still much slower than Polars as you will see.

Pandas Performance

You guessed it. Pandas cannot handle it, pukes, and uses up all the memory on my M1 Macbook. zsh: killed python3 pandas_main.py

import pandas as pd
import glob
from datetime import datetime

t1 = datetime.now()


all_files = glob.glob("data/*/*.csv")
files = []
for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    files.append(df)

df = pd.concat(all_files, axis=0, ignore_index=True)
df = df.groupby([df['date'].dt.date]).sum()
print(df)
t2 = datetime.now()
print("duckdb took {x}".format(x=t2-t1))

This is no surprise as Pandas is well known to be a memory hog.

So if we can’t use Pandas to compare this to, what can we do? Let’s try Polars, a new tool in the Dataframe world.

Polars Performance

We are going to try the Rust-based Polars Dataframe tool, using the Python API.

import polars as pl
from datetime import datetime

d1 = datetime.now()
q = (
    pl.scan_csv("data/*/*.csv", parse_dates=True, dtypes={
                            'date': pl.Date,
                            'serial_number' : pl.Utf8,
                            'model' : pl.Utf8,
                            'capacity_bytes' : pl.Utf8,
                            'failure' : pl.Int32
                            })
)

df = q.lazy().groupby(pl.col("date")).agg(pl.col('failure').sum()).collect()
print(df)
d2 = datetime.now()
print(d2-d1)

I know DuckDB is claiming to be fast, but not fast enough. Polars blows DuckDB out of the water. 0:00:21.527071

Overall thoughts on DuckDB.

My thoughts on DuckDB are complicated. In some aspects, I was a little disappointed, mostly because I listened to the marketing fluff too much. I mean it’s just a nicer version of SQLite, which is helpful in some senses.

But, they talked so much about performance, but my tests were showing it to be at best half as fast as a tool like Polars, with which you can accomplish all the same tasks. I’m just a little disappointed.

Yes, it’s simple to install and use. But, if it’s mediocre for performance, is it really worth my time as a Data Engineer to “find” a spot for it in my data stack? No, it isn’t. In today’s crowded market space, with every vendor hawking and selling everything under the sun, you need to stand out in the crowd.

DuckDB does not stand out to me. I’m sure some folks find it useful for certain edge cases, and it might solve their pain. But it appears you can solve those same pains with familiar Dataframe tools that will run twice as fast.

seryj

Mar 14, 2023

An interesting article, thanks for it. However, I think one could expand your thoughts a little bit. :)

First of all, I assume that edge and serverless computing are two big areas where DuckDB could shine. Being able to easily examine a dataset inside an AWS Lambda or similar technology can be very helpful and save quite a lot of money. Of course, this does not apply to all or very big datasets but even quite big ones stored in Parquet format and which are optimized (in the sense of partitioning, sorting, etc...) for envisioned use case can probably be handled quite efficiently by DuckDB. Similar will hold for Polars, but this is another question. :)

Second, comparing DuckDB with Polars on a dataset stored in CSV format is probably not very fair. DuckDB has its own data format which will be optimized for analytical use cases. And I strongly assume that if one converts the CSV data into this internal format, the query performance will increase dramatically. The fact that Polars is so much faster is probably just due to the fact that it does not interprete all columns from the CSV files (which is definitely clever). Sure, one could argue that "in the real world" CSV is very wide-spread... but if one builds a system which needs to quickly query the data, one will not use CSV files but transform the data into something more efficient (like Parquet or this internal DuckDB format).

Also your comment that using big machines is so old-school is true... but for a lot of use cases a sufficiently big machine is just enough and you don't need a cluster. Having a Spark cluster is cool and so... but for a LOT of use cases, you just don't need more than 32, 64, 128 cores. And data locality can bring huge benefits in terms of performance. And if DuckDB is able to utilize all these core then the performance will probably be sufficient for a lot of people.

So, overall, I think there are a big amount of use cases where DuckDB and similar technologies (like DataFusion, Polars, ...) can save a lot of complexity and money. But one needs to understand the pros and contras of the used solutions.

Expand full comment

1 reply