If you know me, you know I like to stir the pot, the big boiling and smoldering cauldron of Data Tools pot. Yes, that’s the one, blackened and burned pot from years of conjurers pouring myriads of Modern Data Stack tools into it, which have since bubbled and encrusted us all with the refuse of a thousand promises to be the cure-all for our ailments.
That pot.
What better way, me thinks, to unleash the hounds upon myself, while I run through the dark woods of the internet in the moonlight than to pit Polars and DuckDB against each other?
Thanks to Delta for sponsoring this newsletter! I personally use Delta Lake on a daily basis, and I believe this technology represents the future of Data Engineering. Check out their website below.
The Challenge.
What I want to do is compare Polars vs DuckDB in some interesting way, and see what they are capable of. I’m not really concerned about syntax, or features, they are both full-blown projects with lots of users and development going on.
What I am curious about is processing large datasets, at least “larger than memory datasets” on small commodity hardware. This is where a tool has a real chance to steal market share from the likes of Spark.
So what I want to do is pit Polars vs Spark on say a 4GB memory Unbuntu-based machine in the cloud … read say 16GB of data in s3, process it, and write it back to s3.
Standard stuff. Standard Thunderdome.
The Data
We will use the free and open source Back Blaze dataset set found here.
Using the aws cli it looks like we got about 16GBs of data.
aws s3 ls --profile confessions --summarize --human-readable --recursive s3://confessions-of-a-data-guy/harddrives
2024-02-24 15:19:06 0 Bytes harddrives/
....
2024-02-24 15:24:57 97.5 MiB harddrives/2023-09-30.csv
Total Objects: 183
Total Size: 15.9 GiB
Setting up the Compute.
Time to get stuff installed …
root@localhost:~# pip3 install polars
Collecting polars
Downloading polars-0.20.10-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (26.8 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 26.8/26.8 MB 67.4 MB/s eta 0:00:00
Installing collected packages: polars
Successfully installed polars-0.20.10
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
root@localhost:~# pip3 install duckdb
Collecting duckdb
Downloading duckdb-0.10.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.8 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17.8/17.8 MB 36.9 MB/s eta 0:00:00
Installing collected packages: duckdb
Successfully installed duckdb-0.10.0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Let the battle begin. 4GB Ubuntu machine, 16GB of data.
Polars reading 16GB from s3 on 4GB machine.
The first black mark against polars is that the lazy `scan_csv` doesn’t work against s3 yet.
See this currently open GitHub issue, apparently, others would like this as well of course. The only thing left for me to do is resort to some old code I wrote this past summer where I combined pyarrow with Polars to give me the ability to read CSV files from s3.
import polars as pl
import pyarrow.dataset as ds
import s3fs
import os
from datetime import datetime
t1 = datetime.now()
os.environ['AWS_ACCESS_KEY_ID']=''
os.environ['AWS_SECRET_ACCESS_KEY']=''
bucket = 'confessions-of-a-data-guy'
path = 'harddrives'
fs = s3fs.S3FileSystem(key=os.environ['AWS_ACCESS_KEY_ID'],
secret=os.environ['AWS_SECRET_ACCESS_KEY'],
config_kwargs={'region_name':'us-east-1'}
)
s3_endpoint = f"s3://{bucket}/{path}"
myds = ds.dataset([y for y in fs.ls(s3_endpoint) if ".csv" in y],
filesystem=fs,
format="csv")
lazy_df = pl.scan_pyarrow_dataset(myds)
sql = pl.SQLContext()
sql.register("harddrives", lazy_df)
results = sql.execute("""
SELECT date, SUM(failure) as failures
FROM harddrives
GROUP BY date
""")
with fs.open(f"s3://{bucket}/{path}/results/failures.csv", "w") as f:
results.collect().write_csv(f)
t2 = datetime.now()
total = t2-t1
print(f"It took {total} to run Polars")
The code isn’t very complicated, even with the addition of pyarrow to get the job done, it’s fairly straightforward. Note the use of scan_pyarrow_dataset, this allows the work to be done in a lazy manner, with a LazyFrame … until I call collect() at the end to pull the results to a Dataframe so I can write.
The LazyFrame has no options to write directly, so apparently you have to go to Dataframe and manually collect() at that point. Seems like this is something it should infer.
I find it extremely annoying at this point that polars has not built-in feature to scan remote s3 files in cloud storage … for a tool that’s supposed to beat out Pandas and be the next hot thing … come on.
Results …
root@localhost:~# python3 polars-v-duckdb.py
It took 0:03:43.186879 to run Polars
I can’t complain though 3 minutes and 43 seconds to process 16GBs of data on s3 on a 4GB machine.
That’s powerful and promising. The results looked fine.
Duckdb reading 16GB from s3 on a 4GB machine.
I’m looking forward to this one … hopefully DuckDB will be less headache to read and write data from s3. Again, it’s such a common use case these days. Polars is still in the back of my mind not being able to support that out of the box for CSV files.
I digress.
Let’s do this.
import duckdb
from datetime import datetime
def main():
t1 = datetime.now()
duckdb.sql("SET temp_directory = 'temp';")
duckdb.sql("""
SET memory_limit = '3GB';
INSTALL httpfs;
LOAD httpfs;
SET s3_region='us-east-1';
SET s3_access_key_id='';
SET s3_secret_access_key='';
CREATE VIEW metrics AS
SELECT date, SUM(failure) as failures
FROM read_csv_auto('s3://confessions-of-a-data-guy/harddrives/*.csv', ignore_errors = true, union_by_name = true)
GROUP BY date;
""")
duckdb.sql("""
COPY metrics TO 's3://confessions-of-a-data-guy/harddrives/results.csv';
""")
t2 = datetime.now()
print(f"It took {total} to run DuckDB")
if __name__ == '__main__':
main()
I have to say the DuckDB code is much simpler. Not much to worry about there.
Uh oh.
root@localhost:~# python3 duckdb-test.py
Traceback (most recent call last):
File "/root/duckdb-test.py", line 25, in <module>
main()
File "/root/duckdb-test.py", line 18, in main
duckdb.sql("""
duckdb.duckdb.InvalidInputException: Invalid Input Error: Mismatch between the schema of different files
Bugger … I knew this was going too well. But, leave it to DuckDB … looks like they have a solution to that … ignore errors option. Note I added `ignore_errors = true` back into the code above.
Lord willing the creek don’t rise, that will fix the problem.
Dang it.
root@localhost:~# python3 duckdb-test.py
Traceback (most recent call last):
File "/root/duckdb-test.py", line 25, in <module>
main()
File "/root/duckdb-test.py", line 18, in main
duckdb.sql("""
duckdb.duckdb.InvalidInputException: Invalid Input Error: Mismatch between the schema of different files
Well, maybe this option will work … `union_by_name = true`
That worked, but what I greatly feared has come upon me. That which I dread. Like Job.
root@localhost:~# python3 duckdb-test.py
Traceback (most recent call last):
File "/root/duckdb-test.py", line 25, in <module>
main()
File "/root/duckdb-test.py", line 6, in main
duckdb.sql("""
duckdb.duckdb.OutOfMemoryException: Out of Memory Error: failed to allocate data of size 32.0 KiB (3.0 GiB/3.0 GiB used)
And there was great weeping and moaning.
According to DuckDB’s website, it’s capable of spilling to Disk if it runs out of memory.
DuckDB is capable of operating both as an in-memory and as a disk-based database system. In the latter case, it can spill to disk to process larger-than-memory workloads (a.k.a. out-of-core processing). In these cases, a fast disk is highly beneficial. However, if the workload fits in memory, the disk speed only has a limited effect on performance.
After doing some reading, it appears we can enable this ability to spill to disk with something like `SET temp_directory = '/path/to/temp_dir.tmp/'`, otherwise DuckDB will NOT spill to disk automatically if not configured to use a “database file.”
I tried it a few times, and moved that command around all over, with the query, outside the query, didn’t matter. OOM every single time.
Traceback (most recent call last):
File "/root/duckdb-test.py", line 26, in <module>
main()
File "/root/duckdb-test.py", line 7, in main
duckdb.sql("""
duckdb.duckdb.OutOfMemoryException: Out of Memory Error: failed to allocate data of size 32.0 KiB (3.0 GiB/3.0 GiB used)
I read some more StackOverflow complaints about this issue, seems like maybe setting a memory limit, lower than whatever resource you are working on, to force DuckDB to stop at some point and then spill.
People are still opening issues like this as of December 2023 on GitHub but the issues simply get closed. And a similar issued close recently because of “repoduceability”. Classic.
Clearly the DuckDB is sweeping this problem under the carpet.
I set the memory limit to 3GB, keeping some for overhead. That didn’t work either. Still OOM.
root@localhost:~# python3 duckdb-test.py
Traceback (most recent call last):
File "/root/duckdb-test.py", line 27, in <module>
main()
File "/root/duckdb-test.py", line 7, in main
duckdb.sql("""
duckdb.duckdb.OutOfMemoryException: Out of Memory Error: could not allocate block of size 30.5 MiB (2.7 GiB/2.7 GiB used)
I’m pretty much out of options at this point, I’m no DuckDB expert and I’ve tried everything listed in the Docs and from other ideas online, I’ve pushed it about as far as I’m willing to.
It doesn’t seem to be a rare error, from what I Googled people randomly get this problem reading all sorts of large datasets from Parquets to CSVs. Go look for yourself.
Thinking about DuckDB and Polars for Larger Than Memory Workloads.
The funny thing is both systems have their quirks. Goes to prove that no tool is perfect, and thinking any one tool is the end-all-be-all is kinda silly.
It’s clear though that Polars is your best bet to use commodity hardware to cheaply work on larger-than-memory datasets. DuckDB isn’t there yet obviously. Unless maybe you get lucky.
Tools are like people, you never know what you’re going to get until they are under pressure.
My other hesitation with DuckDB now is that it will apparently require large and expensive Cloud Compute to work on any dataset of real size. I’m not sure if you checked pricing recently, but it’s not cheap to have large Memory EC2 instances on AWS, for example.
Polars is the obvious choice here, we had to jump a little fence when we used PyArrow to scan the s3 CSV datasets, but it was pretty simple. Thinking about running pipelines cheaply and effectively, we would want the ability to use cheaper machines to run real-life datasets.
I’m sure I will get angry comments and emails from all the DuckDB zealots but I don’t care. I’m just trying out tools like any other average engineer would do, Googling problems and applying solutions from the documentation and other ideas.
Yet another disappointing experience with DuckDB.
Daniel,
You can move just about any data into DuckDB using Estuary, and it's free up to 10GB a month. Why don't you try it.
https://motherduck.com/blog/streaming-data-to-motherduck/
Mehdi at Motherduck wrote a step-by-step guide on how to use it. You don't have to struggle.
I would never bet against the Duck in Thunderdome.
Hi Daniel.
Interesting article! For DuckDB, I would be curious to set the memory limit to 2GB using SET memory_limit = '2GB';.
Thanks!