In my never-ending quest to pound DuckDB into the dirt (don’t take it personally, I do it for the good of all Data Engineers) I realized I’d forgotten something oh-so-important. I mean there are few more tedious things that we have to do day in and day out than date and datetime manipulation.
So, today we will do the boring.
My hope is that it will be boring, I dare say that would be a good thing. When it comes to this sort of thing, the piddling and meddling with timestamps and the like, you don’t want it to get exciting.
The importance of the small stuff.
One thing you will note, if you are an astute observer of my general modus operandi, is that when it comes to newerish tools, I typically do the basic stuff first.
I simply build a normal pipeline. I read data from s3, munged the data around, and then write the data to s3. I mean what is more boring and ordinary than that?
Thanks to Delta for sponsoring this newsletter! I personally use Delta Lake on a daily basis, and I believe this technology represents the future of Data Engineering. Check out their website below.
The point is, you would be surprised how many times you find holes in the bottom of the bucket when doing this stuff. I find it a breath of fresh air amidst the muck and mire we call data content that has been infiltrated by those marketing pirates who seem to have swallowed us all whole.
That is why we are going to do something boring, like play with date and datetime manipulation in DuckDB. We gonna squeeze the ducky till it makes a noise.
DuckDB date and datetime buggering
It’s hard to know where to start, but I want to make it as realistic as possible. There are a few general date and datetime manipulations that cover 80% of what actually needs to be done most of the time.
Probably one of the most common manipulations that Data Engineers do for datetime manipulation is turning STRING dates into real dates. Think about getting some date/datetime column in a CSV file and needing to cast it to something useful.
Luckily and surprisingly DuckDB has something that all Python people will recognize … strftime and strptime .
We will be using Divvy Bike Trips open source dataset to do our playing around with.
strftime and strptime
This should be straightforward, let’s convert the started_at column which is coming from a CSV file to an actual timestamp, for example, `2024-06-11 17:20:06.289`.
It’s also important to note that DuckDB has a CSV autoloader that can do a great job sensing formats, including complex datetime formats, like in this example.
Above you can see auto-reading the CSV file and the timestamps for started_at were interpreted correctly.
But, we can do the same thing with similar text using the method strptime.
If you’re like me and are used to using ANSI SQL-type statements, I’m curious if this works.
That’s a good sign, having a CAST that works with date and datetimes can really make life easy.
Adding and subtracting days
Another one I’m curious about, because it can be a pain if it’s not supported, but comes up often, is the simple addition and subtraction of days.
Many times we find ourselves with a date and we want to go back 30 days, or forward 30 for that matter. Let’s see what DuckDB has to offer us in this respect.
Well, that is kinda funny and refreshing. Nothing easier than adding a few days apparently. Simply using the addition + operator to add some days makes it no no-brainer.
Never fear, DuckDB has a normal method for this also, if you want to be explicit, adding days, months, or whatever … using the date_add() method.
Nothing worthy of note there, which is a good thing.
Date parts
Again, in an effort to do the mundane, another piece of date magic we all do on a regular basis is pulling date parts out, say a year, month, or day of the month etc.
Easy stuff, good stuff, DuckDB making it easy.
Random date stuff
Wonderfully, there is much more date and datetime stuff in DuckDB, almost too much to go through at one time. But, here are some useful methods that will probably come in handy for normal Data Engineering work with DuckDB.
We can get a date_diff() for two dates, either by day, month, etc, whatever.
Get the day of the week.
The very important … give me the last day of the month from said date.
DuckDB and dates
Well, that was excitedly unexciting after all, which is good. I was hoping for no surprises and found none. I’m glad DuckDB has wide and familiar support for general date and datetime manipulation that are fairly common across Data Engineering.
I think it’s a great show of maturity that DuckDB can do all these simple things like adding dates, pulling date parts, converting text to date and timestamps etc.
It’s a clear sign of immaturity when they cannot support base date manipulations like we saw today. It’s the small things that matter when it comes to Data Engineering tools that might be used in a Production setting. Sure, it’s great if you can read a CSV file from s3 … but the real question is … what can do with it once you have it??