Should you use DuckDB or Polars?

or both ... ?

Oct 07, 2024

This is an interesting question indeed, is it not? What to use, what to use? Both DuckDB and Polars seem to be flying high at the moment, the new cool kids on the block. Everyone talking about them, and 1% of the people actually using them. Typical.

But, the question remains. What should be learning and using DuckDB or Polars?

A person would think that on the surface these tools do the same thing, and are for the same target audience. Just data tools right? Hmmm … I think not.

I always like to take what the flavor of the day marketing is telling me with a grain of salt, instead, I prefer to use the tools, watch others who are using the tools, and have general common sense applied.

So no, I don’t think these tools are the same thing, yet they do have ALOT of overlap.

What is DuckDB?

We need to ask the question in case you're a Neanderthal living in a cave.

“DuckDB is a fast in-process analytical database. DuckDB supports a feature-rich SQL dialect complemented with deep integrations into client APIs.”

What is Polars?

And for Polars?

“DataFrames for the new era. Polars is an open-source library for data manipulation, known for being one of the fastest data processing solutions on a single machine. It features a well-structured, typed API that is both expressive and easy to use.”

So what next?

Ok, so right off the bat we can see a slight divergence from each other in the way they even market themselves to the general public. This should tell us that although they might be similar tools in our minds, they are sorta pointing themselves in different directions.

Thanks for reading Data Engineering Central! This post is public so feel free to share it.

What you should use.

Even though they are similar tools with similar use cases, they are built for a different target market, even if their makers wouldn’t say so.

DuckDB is SQL-centric and built for those people who work with SQL on a daily basis.
Polars is the new Pandas. It’s for the Data Science/ML and DataFrame crazy folk who program on a daily basis.

Yeah, you can write SQL with the Polars SQLContext, but if you’re building a SQL thingy, you should probably reach for DuckDB … you will be happier in the end.

If you’re a programmer who lives in Python, works in Databricks a lot, works around Data Science and ML, Polars is going to feel more familiar and feel better in your fingers.

For example, check out this simple GitHub repo I wrote a while ago, just to show a basic data pipeline with Polars and DuckDB.

One of these pipelines is just going to feel more familiar and better to you. Kinda like vanilla or chocolate ice cream, which one do your reach for?

import polars as pl
import pyarrow.dataset as ds
from datetime import datetime
import s3fs


def main():
    t1 = datetime.now()
    bucket = "confessions-of-a-data-guy"
    key = ""
    secret = ""

    fs = s3fs.S3FileSystem(key=key,
                           secret=secret,
                           config_kwargs={'region_name':'us-east-1'}
                                          )

    s3_endpoint = f"s3://{bucket}/"

    myds = ds.dataset([y for y in fs.ls(s3_endpoint) if ".csv" in y], 
                      filesystem=fs, 
                      format="csv")
    lazy_df = pl.scan_pyarrow_dataset(myds)

    lazy_df = lazy_df.groupby("started_at").agg(pl.count("ride_id").alias("ride_id_count"))

    with fs.open("s3://confessions-of-a-data-guy/harddrives/metrics-polars", "wb") as f:
        lazy_df.collect().write_parquet(f)
    
    t2 = datetime.now()
    print(f"Time taken: {t2 - t1}")


if __name__ == "__main__":
    main()

or …

import duckdb
import pandas as pd
from datetime import datetime

t1 = datetime.now()
cursor = duckdb.connect()

df = cursor.execute("""
                        INSTALL httpfs;
                        LOAD httpfs;
                        SET s3_region='us-east-1';
                        SET s3_access_key_id='';
                        SET s3_secret_access_key='';
                        
                        CREATE TABLE data AS SELECT CAST(started_at as DATE) as started_at_date, count(ride_id) as ride_id_count
                        FROM read_csv_auto('s3://confessions-of-a-data-guy/*.csv')
                        GROUP BY started_at_date;
                    
                        COPY data TO 's3://confessions-of-a-data-guy/ducky-results.parquet';
                        """).df()
t2 = datetime.now()
print(f"Time taken: {t2 - t1}")

I mean they get you to the same place. One of them probably just fits in your architecture and background a little more than the other, and that’s ok.

You should use whichever one you want. I wouldn’t say no to someone using or the other, or both in production, in a data pipeline.

My guess would be folk could find that both DuckDB and Polars would probably fit in perfectly in different parts of their pipelines. Most likely save some money too. Tools like these are the future.

Spark is here to stay for many years to come, but there does seem to be a new trend in town.

The single node and fast data processing tools are popping like hotcakes. I myself have replaced Spark on Databricks with Polars on Airflow to save money. It works great.

I imagine there is a fair amount of Snowflake users who could migrate a portion of their workloads to DuckDB on a single node and same some money.

DuckDB in the wild.

I mean this is an interesting question, who is using Pandas in the wild, and for what?

According to Reddit users, those zombies and infernal rabble, DuckDB is used for what you think it would be. A replacement for Pandas and other data wrangling.

It is hard to understate the importance of DuckDB in the future of Data Engineering. People are using it locally for development and data processing, people are finding places in Production pipelines to swap in DuckDB. People are also building tools with it.

DuckDB is one of those quiet tools that sneaks up on you, and before you know it, it’s going to be everywhere. Its ease of installation and simple usage are the keys to its wide adoption. You can’t argue with something that simply works.

Polars in the wild.

The funny thing is, unsurprisingly, that it’s hard to find the same sort of hype in the wild about Polars. This, for some of you, might seem strange, but it’s really not.

DuckDB is SQL-centric and will take center stage with the average Data Engineer for that reason. Polars will have a harder journey.

People go to Polars because it’s nicer than Pandas, is cleaner, and can even replace Spark in some instances. There just isn’t a large base of people doing this … yet.

Also, the truth is, Polars is the perfect tool for larger datasets, (which most people actually use), but again, many data users simply don’t work with large datasets.

I personally pick Polars over DuckDB, but that’s because I like programming over SQL, I also work with very large datasets, but that is not the major use case for most.

DuckDB and Polars are here to stay.

Both these tools have made inroads into the Data Engineering community and will not be leaving anytime soon. If you are a SQL person, pick up DuckDB, if you enjoy programming more and work with Spark, you should learn Polars.

Thanks for reading Data Engineering Central! This post is public so feel free to share it.

Nathaniel Ramm

Oct 7

I have found that DuckDB still has a few issues that can cause it to blow up unexpectedly - such as loading a dataframe that is the same number of rows as the type sampling process, and columns with all null/None.

Polars has never failed me. And it is the fastest option for the processing I do. Around half the time as other approaches (pandas, arrow, duckdb) for the same tasks.

For bonus points, try ibis with polars as the backend rather than Duckdb. Bliss!

The ibis syntax is also much cleaner - closer to dplyr.

Expand full comment

Resumo dos Dados

Good post! I will try polars for my master's degree project. I am currently using spark, but it is just for one large dataset. Maybe polars is a better choice.

1 more comment...