DuckDB and Mother Duck now seem to be a common part of the Data Engineering commentary. It’s probably here to stay. Say what you will about the Mother Duck thing and what it can and cannot replace, the future will decide.
Knowing that it’s here to stay has got me thinking more. It’s obvious that DuckDB scratches an itch on the Data Engineering back that has been itching. It’s fast, it’s SQL, it’s easy to use. But, I also know that people use things however they want, and it will probably end up being different from what most people think before it's all settled.
What’s been on my mind about DuckDB? Is it just a new single-node Analytics Engine? Is it an in-application data store? Can it be used for just normal Data Transformations and Pipelines? All of the above, one of the above?
Just because you can do something doesn’t mean you should. Let’s talk about it.
Thanks to Delta for sponsoring this newsletter! I personally use Delta Lake on a daily basis, and I believe this technology represents the future of Data Engineering. Check out their website below.
DuckDB for Data Engineering
Ok. DuckDB.
DuckDB is an in-process SQL OLAP database management system. All the benefits of a database, none of the hassle.
- DuckDB
It isn’t hard to figure out how DuckDB is so popular. SQL is popular and everyone can use it. This is where the simple genius of DuckDB comes into play.
Let’s say you’re a Data Engineering team who is full of people who are SQL-heavy in experience. Most of your data resides in some RDMBS like SQL Server, Postgres, Oracle, MySQL, or whatever. Also, you’re trying to move your tech stack forward and make better data pipelines and transforms like a good little hobbit.
What tool do you use? You have few options probably.
Some canned semi-GUI tools like SSIS or Talend or some other junk.
Dive into the deep end of custom pipelines with Python or PySpark, or something else.
Neither of these options might be the best fit for a lot of teams. They want to graduate from stuff like SSIS because of its inherent inflexibility. But, you might not be ready or have the technical chops to drive straight into PySpark for example.
Enter DuckDB.
Simple. Easy to Use. SQL. Database without the database.
SQL for Analytics vs Transformations.
We’ve probably come full circle in the last decade. From too much SQL, to more Dataframe centric and code, back to SQL again. The infinite loop.
When it comes to DuckDB, the so called OLAP database to rule them all, the question is, is it just for analytics (OLAP), or is it an end-to-end Data Engineering tool. Can it, and should it be used for data transformations all along the way?
Thinking through it.
What I wonder about while laying awake in bed at night is, whether is it worth inserting DuckDB into your Data Engineering pipelines, and how much and where it takes over other tools.
No one using Snowflake, BigQuery, or Spark is going to replace those things with DuckDB (unless they shouldn’t have been using those tools).
People using Pandas, Python, Postgres/MySQL, Polars and other non-distributed systems could probably insert DuckDB as a replacement.
DuckDB is in process, it’s only temporary, no long-term storage.
That’s helpful to start seeing where DuckDB can and might go. There are a lot of companies who want to use something from the Modern Data Stack, but simply don’t have the data size to justify things like Snowflake/Databricks/BigQuery.
What else makes DuckDB so attractive?
The thing is, you can simply `pip` install DuckDB and start using it in your pipelines immediately to transform and aggregate data … with SQL. People like easy, and people like SQL.
DuckDB for Data Pipelines
Here are some of my worries with DuckDB, although they are not specific to DuckDB per se, but more to SQL-heavy pipelines and transformations.
The hard line it takes to make small and reusable transformations.
Unit testable
Knowing you can’t scale past a single machine easily.
SQL-only options for data transformation.
I’m simply saying the ability to apply best practices. This doesn’t have to do with a particular tool as much the culture of a particular Engineering team to do the development correctly. I’ve found that purely SQL based teams struggle in this area.
This is interesting. Recently I wrote two simple Data Pipelines, one in DuckDB and one in Polars.
https://github.com/danielbeach/TheBearVsTheDuck
And then Polars.
I mean is there really any big difference to the naked eye between these two? No.
Are there differences … Yes.
For me, I’m not so sure how I would go about making small testable functions and methods to methodically pipeline transformations together into a single reusable and approachable pipeline.
I don’t want to just write extra large SQL statements that make up multiple steps in a pipeline, that simply would take us back to the days of old.
What can DuckDB pipelines look like in the wild?
I thought I would go searching to find some larger and more real DuckDB pipelines in the wild. I ran across a blog post and some code for some benchmarking with dbt+DuckDb.
Which led me to this repo https://github.com/datamindedbe/tpcds-dbt-duckdb
This led me to some sample SQL that people would write and run https://github.com/datamindedbe/tpcds-dbt-duckdb/blob/main/dbt/dbt_duckdb_tpcds/models/normal/tpcds_q02.sql
This is where my beef comes into play. How would you like to have more complex, and many more pipelines built like this? I wouldn’t. This harkens back to the old days when everyone used stored_procs and various and numerous massive SQL files to do everything (I lived that life for years).
There was a reason people rebelled against that travesty.
Inflexible
Becomes extremely over-complicated quickly.
Logic becomes disparate and impossible to manage.
Debugging and troubleshooting time and difficulty go through the roof.
I get it, this has a lot to do with Engineering culture and practice. We have to draw a line in the sand and say “We will not write spaghetti SQL and call it a pipeline.”
But we are human, we take the easy pass, we try to get stuff done quickly, and we have deadlines. Can we say no when all our friends are doing it?
I’m not so sure.
Why DuckDB is still awesome.
Even with all that said, and I feel like I had to say it. I still think DuckDB is awesome and a much-needed tool.
It’s dead simple to use.
It’s lightweight on our data infrastructure.
It’s SQL.
It’s fast.
It has tons of integrations with other tools and file systems.
It makes data exploration and analysis dead simple and easy.
It can replace other heavy and slower data transformation technologies.
I think DuckDB is the perfect tool to enhance and replace SOME parts of the Data Engineering pipeline tech stack. It’s SQL. Fast. Lightweight.
It can make our pipelines better and less complex. Just like with any tool, don’t go overboard, slow, and methodically, use best practices.
I think dbt is actually the answer to the issue of SQL modularity and testing. The project you refer to just isn’t following best practices: https://docs.getdbt.com/guides/best-practices/how-we-structure/2-staging
That said, I personally aim for as much SQL for the basics, then do the rest with whatever tools the team chooses (probably Python). The reason is that while SQL can be a painful monstrosity to look at, it’s declarative and that means that you get performance boosts any time you upgrade your database without needing to update any SQL (yes there are regressions but after those get fixed).
The problem with maintaining your own code is that it doesn’t get better for free and you’re relying on your team to make sure it works and as business logic changes, you don’t miss any assumptions in the imperative logic many hands have touched. Sure that’s true of databases, but multiple companies have collaborated and run these systems in production, they’ll run into and resolve issues for you faster than you’ll encounter them much less fix them.
It’s like owning your own house versus renting. Yeah you get to own your own property, but if something breaks, you’re now the landlord that must pay to fix it.
So maximize SQL, make it modular and testable with dbt, and write your own code only when necessary.