CI/CD for Data Engineers

Automate Automate Automate

Jun 21, 2024

The older and more crusty I get, the more my fingers hurt from pounding those keys in data frustration, one thing has become clear as I look through glassy eyes at the last few decades of writing code and squeezing data till it screams.

Automation is the name of the game, an unsung hero ignored by the teaming masses of new developers who spew onto the software landscape year after year.

Part of me thinks that it’s just not something that comes naturally to someone in the beginning. When we start our coding and data journies, we skip along the trail of life happy and flamboyant, full of future dreams and expectations.

We pretty much only think about ourselves and the wonderful (in our eyes) code our flexible and supple little fingers pound out.

Any freeloaders out there want 50% off for a year to a paid Subscription. Click the link below. https://dataengineeringcentral.substack.com/justbecause

Get 50% off for 1 year

You should check out Prefect, the sponsor of the newsletter this week! Prefect is a workflow orchestration tool that gives you observability across all of your data pipelines. Deploy your Python code in minutes with Prefect Cloud.

What is CI/CD in a Data Engineering Context?

In case you live under a rock or just slid off the software engineering assembly line … what does ChatGPT say CI/CD is?

“CI/CD stands for Continuous Integration and Continuous Deployment (or Continuous Delivery), and it's a set of practices and tools in software engineering that aim to deliver code changes more frequently and reliably.”

You know, me thinks this is a high and lofty goal. Honestly, most companies fall into one of two categories, in my experience.

Already the masters of CI/CD and need no help.
Little to nothing in place.

Sometimes you find people wandering around in the middle of these two, but not that often.

I would like to propose a few basic ideas for Data Engineers looking to learn about and integrate CI/CD into their companies and workflows.

In essence, can we boil the stewing cauldron of bubbling ideas down to a few actionable and straightforward points that everyone can embrace?

While I’m sure there are a number of Platform and DevOps Engineers who will spit and moan while reading my simple steps for CI/CD for Data Teams, pooh on them.

While there are many ways to make this complicated, the fact is that most Data Teams struggle to even do the simple things.

Diving into CI/CD for Data Engineering - Testing

Let’s start from the beginning, the very beginning, where God said “Let there be Testing … and there was testing.” I think when it comes to CI/CD and needing to start from somewhere … that somewhere is testing.

Why?

Because those who don’t test at all are going to have a certain culture, codebase, and architecture that probably doesn’t lend itself to CI/CD.

If you are unwilling to have tests … but think you will embrace CI/CD? Don’t kid yourself.

I can’t tell you how many people I’ve talked to and data teams I’ve come across that when pushed and prodded, and they can’t answer the following questions is a “yes.”

“If cloned your data repo today, could I with one or two commands, run all the tests locally? Also, if I made a change to said repo and pushed a commit, would tests run somewhere?”

Just take a gander around r/dataengineernig and you will discover testing is a question often asked … meaning lots of people don’t do it today.

I’m not really going to spend any time trying to convince you that testing your data pipelines and transformations is something you should do. It’s kinda obvious and unbelievers are anathema to me.

Also, I can’t go into depth to teach you everything about testing, so I will link some articles below if you need more info.

Thank you for reading Data Engineering Central. This post is public so feel free to share it.

Why automated testing?

What it really boils down to though is automated testing. What do I mean?

Anyone can run tests locally on their machine after touching and developing a codebase.
Anyone pushing changes to a repository should trigger automated tests to run.

This is critical to protecting codebases and can act as a first line of defense since we are all human, we all make mistakes from time to time.

And lest you think this is hard or some black magic, here is a piece of a production YAML file for CircleCI to run automated tests on a commit.

And yet again, here are some GitHub actions running on tinytimmy, an open-source package of mine.

It can seem simple, but that is kinda the point, it should be seamless and just part of the everyday process. Testing is important for well-rounded data teams, and automated testing is a huge piece of that pie.

Diving into CI/CD for Data Engineering - Automated Deployments

This is another small but important piece of CI/CD for Data Teams and any team developing software. The automated deployment of whatever exists in your repository is a baseline for Data Platforms that aren’t held together with shoe-string and glue, and “oh crap they are on vacation … I don’t know how to do that.”

This is really what most CI/CD comes down to in the end.

How do take humans out of the boring process that can break?

Every single Data Platform usually has different tools and architecture, they all run their code slightly differently, and it’s untypical for two companies to run the exact same stack.

If you have certain pieces of code and infrastructure that need to be deployed to certain spots in a certain way … you can’t just rely on new engineers to do this, or even experienced ones to remember everything all the time.

So, some simple scripting can deploy your code the way it needs to every single time. You know what to expect.

Instead of Engineers spending time piddling with manually pushing code or infrastructure changes here and there via random scripts or the UI, automated deployments ensure they can focus on what they need to… producing quality software.

And, honestly, it would only take a few days to set up the automated deployment process in the first place, and save serious time and trouble downstream.

Diving into CI/CD for Data Engineering - Clean Code

One thing that most Data Teams can fall victim to as well is the atrophy of code. Everything usually starts with good intentions, but life and timelines take over and things start to fall off the wagon.

“Code atrophy in software engineering refers to the gradual degradation or obsolescence of code over time.” - ChatGPT

It can be hard to keep on top of code quality and expectations with just the PR and code review process. Probably depends on what kind of day someone is having, are they going to pay attention or even care from one day to another?

Some folk might find it easy to pooh-pooh this sort of clean code and expectations of quality, but in large codebases with multiple people and teams committing changes, things can get messy and out of hand easily.

It’s hard to understate the real cost of an unclean codebase when it comes to feature addition and debugging.

I mean the first and simplest step would be to run a `ruff` check as part of the automated checks when someone commits and code and fail the commit if it doesn’t come back clean.

Of course, there are other formatting tools like `black` for Python etc.

Wrapping it up.

As you and I journey through the never-ending labyrinth of data engineering, it's clear that automation, especially CI/CD practices, is indispensable. The path from manual to automated processes may seem daunting, but the benefits far outweigh the initial effort.

It’s what separates good Data Teams from the Milk Toast cadre of developers.

Embracing testing and automated deployments not only enhances code reliability but also frees up precious time for innovation. Remember, the foundation of a robust CI/CD pipeline starts with testing.

Ensuring your code is clean and deployable without human intervention creates a resilient and efficient workflow. So, take the plunge into CI/CD, and watch your data engineering processes transform.