Unit Testing for Data Engineers.
I know you don't want to, but if you don't I will call your grandma.
Don't make me call your grandma and tell her what you've been doing. Writing all that code with no tests, just busting through life like you got no worries. Letting tomorrow worry about itself. How dare you.
Get off the couch, put the potato chips down, and write some unit tests.
What we are going to cover.
Why don't data folk unit tests more?
Why data folk should unit test.
The anatomy of testable code … aka the How.
The end.
Why are Data Engineers so bad at performing simple tasks like unit tests that can solve so many problems? The million-dollar question that one.
Thanks to Delta for sponsoring this newsletter! I personally use Delta Lake on a daily basis, and I believe this technology represents the future of Data Engineering. Check out their website below.
Why data folk don't unit tests?
It's funny that something as foundational as unit testing, which has been accepted in the SWE world for decades, is such a struggle for data teams to adopt. It’s rarely argued that unit tests are for suckers, it’s pretty much generally accepted as a key way to maintain software quality control and ensure the squashing of bugs, but yet it remains un-implemented in the vast majority of data teams.
I think there are a few reasons this is true.
Too much SQL.
Focus on moving fast.
Data is seen as a quasi-business function and less engineering in the past.
More folks from a less traditional SWE background.
“We want to get there, but just haven’t had the bandwidth yet.”
Below is the pyramid of death for unit testing.
It’s my not-so-humble opinion that these sorts of situations formed a perfect storm, they all mixed together like some witch’s brew boiling and bubbling in the corner, and producing a poison of indifference and carefreeness in the teaming masses.
Too much SQL.
I know. Sorry, I'm not sorry. I'm messing with your precious SQL again, I didn't learn my lesson the last time around apparently.
Speaking as someone who once wrote SQL for years on end, it is a problem. I'm convinced for the most part that those data teams that are 80% SQL based, probably have very few, if any, unit tests.
Send me the hate mail, I'm ready. Sure, dbt has changed a lot of things and some people are getting the memo, but most are not.
Why is this the case with SQL? Mostly because be design, or over time, all the SQL queries become large, complex, and cumbersome, and end up unwieldy and not reusable. Basically, an giant blob that has swallowed everything in its bloated path.
Such things do not lend themselves to unit testing. Oh, and don’t forget history, isn’t that what they are always telling us? History matters, and historically SQL is rarely tested, and things take time to change.
Focus on moving fast and business functions.
Another thorn in the flesh of many data teams is that fact that they are driven to rust fast by hard task-masters. It’s hard not to notice the difference, the data teams always sit closely to the business, which makes sense, the business claim to be “data-driven.”
This can have some unfortunate side effects.
Close to the business means high expectations and moving quickly.
Moving fast means something gets left behind (testing).
No surprises here. The truth can hurt sometimes, but honesty is refreshing. It’s better to understand why something is the way it is, then we can see clearly to deal with it.
If there is one takeaway from data teams I would give, it’s to slow down, take your time, and write tests. This sort of approach, while it may anger your over-lords who want stuff right now, but they don’t care about the bugs and breakages that will keep up in the middle of the night a month from now. Push back. Slow down. write tests.
The anatomy of testable code … aka the How.
Honestly, how to write code that is unit testable on a data team isn’t really earth shattering, but is surprisingly hard to find.
“Mostly because there is a correlation between the IF unit tests are written or not, and how clean the code is.”
The hurdle to having testable code is …
Are you walking into a dirty codebase with massive functions that aren’t testable?
Do you have to refactor the code before you can unit test it?
Do you have infrastructure and knowledge to allow unit testing (Docker, etc.)
To write unit testable code as Data Engineer you should approach your code as follows.
Functions or method should be as small as possible (fewest lines of code).
Functions or methods should have as few side-effects as possible.
Functions or methods should be generalized and reusable.
For example, say we are working with PySpark and need to apply a window function, a very common task to filter out data and get the latest record.
Is this function unit testable? Why yes, in fact it is. It’s simply applying some logic to a Dataframe, which we could easily mock up with a unit test.
What makes a function not very unit testable? Let’s look at this function in way it could have been written. Something not uncommon in the DE world.
Now all the sudden we have a mess, and this is probably nice compared to a lot of DE code floating around. We have a few side effects and complexities, all wrapped into a single unit of work.
reading some remote s3 bucket for data.
reading a second data source, and the joining.
along with our original logic of filtering.
Does anyone want to raise their hand and write a unit test for this one? Not me. Why? because it’s just simply messy and not broken up into logic units of work. What are we really testing? Read the first data set? Reading the second one? Filtering? Joining? You get the point.
When starting down the long and winding path for writing code that is testable, one fraught with peril and tears … just remember a few simple steps.
Keep it simple and clean.
Break up your logic into units.
Reduce the lines of code inside a function.
Reduce the amount of logic and “things” aka side effects that happen.
These are simple and straight forward steps that anyone can implement in their next code project.
The End.
What did we talk about today? Unit testing. It’s lack in most data teams for various reasons, most of them obvious and easily fixable. I can’t tell you how many times I’ve heard from people “we want to, but we haven’t gotten there yet.”
Usually there is a big story behind the “we haven’t gotten there yet.” There is a lot of technical debt, bad decisions, the business pushing for things. At some point you have to pay the piper, bite the bullet, look towards your future.
There is no easier way to take your data team to the next level than to simply start writing unit tests, all sorts of good things will follow.