When discussing Data Engineering there are few things I can think of fraught with more bitterness and vitriol as config-driven data pipelines. Or maybe I’m just jaded. It’s probably me. That’s what the first person you dated in high school said uh? It’s me, not you. Yeah …
Lauded by the teaming and raving masses of Data Engineers for flexibility and scalability, these “config-driven” pipelines are nonetheless riddled with “hard stuff” that often remains hidden beneath the sparkily and glittering allure of easy configurations.
Nothing is ever as it seems. There is always The Dark Side. Today, I'd like to shed light on some of these challenges from a Data Engineer's perspective.
You should check out Prefect, the sponsor of the newsletter this week! Prefect is a workflow orchestration tool that gives you observability across all of your data pipelines. Deploy your Python code in minutes with Prefect Cloud.
What the devil is a config-driven data pipeline?
While we could probably argue up and down about this very point, for our purpose, we will define a config-driven data pipeline as …
“Data pipelines who’s logic and arguments are controled by configurations external to the actual code running “the data pipeline.”
I’m pretty sure you will know one when you see one. There will probably be some giant JSON or YAML file, full of strange and wonderful things, some reading and unpacking of that config, and then logic tied to those data points. Kinda hard to miss.
The Upsides of config-driven pipelines?
Before we bemoan the evils that have befallen us due to these config-driven pipelines, we should probably be fair and laud their praises like good little peasants.
To be fair, I’ve written, and even very recently written some large and complex config-driven pipelines, all the while, crying over my cursed keyboard at the monster that was taking shape beneath my wicked fingers.
Why did I create these warped and wonderful children?
Config-driven pipelines can introduce great flexibility.
A single config-driven pipeline can take the place of many other pipelines. The one for the many.
Less or no code changes are required to make more updates or changes.
Less technical folks are able to interact with pipelines more. Like Data Scientists and Analysts.
Decoupling of concerns … code vs params.
I’m going to stop there. I could be nicer, but since I don’t like them I will stop before you grow to connected and see all the bright spots.
Example of config-driven pipeline.
Here is an example of a simple config-driven data pipeline, just to give you an idea if you’ve never run into one before. Now this is going to be simple in the extreme, but keep your mind open.
Let's consider a simple data pipeline that reads data from a source file, performs some transformation on the data, and then writes it to a destination file. The configuration file will determine:
Source file location
Destination file location
The type of transformation to be applied
Here is the JSON config file.
And here is the simple Python script.
Of course, in real life, many config-driven pipelines are large JSON or YAML files with many nested configurations for larger and more complex things. But, you get the drift I think.
There are lots of BOOLEAN switches in config-driven pipelines. The logic that says “If this is something or another … do this thing or that.”
Where are config-driven pipelines useful?
I find in reality config-driven pipelines are useful in two different major spots.
Filters, inputs, lists.
BOOLEAN logic switches
Okay, let’s get to the dark side.
The evils that befall config-driven pipelines.
It’s time to talk about evils and what to watch out for in config-driven data pipelines. These things are not made up, but what I’ve experienced and seen firsthand. They are the reality.
We know that all things in life are colored with both good and bad, and it’s good to know what we are dealing with on both sides of the coin.
Without good documentation, config-driven pipelines are a mystery to behold.
Decoupling configs from code comes at a price … harder to reason about logic.
All possible states of code and acceptable configs are hard to divine.
Debugging can become more complex and difficult.
It’s probably because I’ve been burned too many times as a programmer trying to fiddle with some obtuse and obscure configs with little to no documentation.
It’s not just about knowing what the configs do, but what they control, HOW they interact with the code, and affect it in different ways.
Complexity Concealed, Not Eliminated
We’ve talked about complexity before, about how a lot of complexity can simply be shuffling the pea underneath different cups, the question is … can you really remove the pea, or are you just moving it from one place to another?
I conjecture that config-driven data pipelines are trying to solve a problem … namely, inherent complexity. Most likely. Otherwise, there would be little use for them.
One of the main selling points of config-driven pipelines is that they abstract away a lot of the complexity inherent to data processing. While this is true in some aspects, it's a bit of an untrue statement. The complexity doesn't disappear; it's merely concealed or moved. As configs grow, maintaining and understanding them can become as daunting as managing the codebase they replaced.
Error Handling and Logging Ambiguities
Config-driven approaches often provide very basic and “normal” error handling and logging. The nuances of the business logic and the particulars of the integration points between the code and configs will require more customized error handling.
Knowing what configurations were active during the processing and logging is going to be key to unwrapping what is actually happening, since the logic many times is tightly coupled to the configs.
A config-driven approach can lead to ambiguities where the pipeline's response to errors and logs may not always be predictable or appropriate.
Let’s take for instance this example of a config-driven pipeline that is transforming CSV data. Again this is simple, but you get the point. Imagine that there are many complex and nested configurations.
These configurations cause different code paths and executions. Say a problem happens and is logged. How do you map between what the config was that initiated the pipeline, and the code that failed at some transformation?
How can this be known during stressful production debugging in a simple and succinct manner? Walk through the code again below, does it do this?
Steep Learning Curve
This is a topic that is often overlooked in Engineering circles, scoffed at by tenured and wizened old software Wizards who think people should just figure it out.
The reality is that new people, junior engineers, Data Scientists, and the like are going to have to be able to unwind and understand what is happening.
Although config-driven data pipelines are marketed as user-friendly, they have their own learning curve. Engineers need to understand the underlying framework's nuances, the specifics of the config schema, and the implications of each configuration option and how it interplays with the code, affecting data flows. T
This can be just as time-consuming as learning a new programming language or library. Especially if the learning is happening at the same time as a production failure.
Other Options? Event-Driven Pipelines?
I’ve also been thinking about other options than config-driven pipelines … the problem is that they are solutions to complex business-driven needs and it isn’t always easy to find another solution.
Event-Driven Pipelines are probably the newest form of technology we deal with that can handle the scale and type of complexity that our static config-driven pipelines deal with today.
Streaming tools have become much more approachable and easy to use these last years, making the rearchitecting of static config-driven pipelines to more event-based a real possibility.
In Conclusion
I don’t know. Maybe I talked myself back into config-driven data pipelines after all. Maybe not. I’m still torn.
They are very handy in many instances, I’ve designed some true wonders of the Worlds, with documentation and all. They are flexible, allowing various complex pipelines to run and be triggered by many non-technical folks.