5 Data Engineering Mistakes

and what do about them

Jul 22, 2024

Some lessons are learned hard, some not so much, but I generally think we humans are predictable and mossy old stones stuck in our own little divots of comfort and ease, not easily budged.

I dare say when the older us looks back on the younger us we can be envious of all that spitfire and vitriol that propelled us forward to where we are today. We can also shake our heads in wonder that we actually managed to land where we are today.

The lessons seem obvious now, not such back then.

You should check out Prefect, the sponsor of the newsletter this week! Prefect is a workflow orchestration tool that gives you observability across all of your data pipelines. Deploy your Python code in minutes with Prefect Cloud.

Today we will delve into 5 of the most common Data Engineering mistakes I’ve run into over these long and languid years. It seems to be a never-ending humdrum of lackluster Data Platforms, one after the other, like some indomitable Model-T assembly line from the turn of the century.

I end up wringing my hands in wonder at the empty sky, asking why is it so difficult to find that diamond in the mud and swap we call Lake Houses, once called Data Lakes, once called Data Warehouses. Where are those who follow the way of old?

Before I wax any more poetic, I should turn myself towards the matter at hand. The most common Data Engineering mistakes … and what to do about them.

Instead of tempting you with bits and pieces, luring you to the end of the article, instead, open your mind wide, and I will give you your portion in full.

5 Common Data Engineering Mistakes

Not embracing simple architecture and design.
Not having a good local development environment.
Not having a good orchestration and dependency management tool.
Not testing code and pipelines before release.
Not doing something hard.

Let’s review these little evils one by one and see how they are the bane and true enemy of every Data Engineer looking to build reliable Data Platforms.

Here we go.

Not embracing simple architecture and design.

There is nothing more classic than a Data Platform that apparently has some FOMO. At least that’s what you would think when you survey the excessively large number of tools in use.

This can apply at both the macro and micro scale inside a Data Platform. Do you really need both Postgres, Snowflake, Redshift, and Databricks? No, you do not my friend. This simply indicates a fractured architecture with no direction or driving force.

Do you have more than one testing tool?
Do you have multiple data stores?
It’s unclear where to run compute or store data.

Simplicity is underrated when it comes to Data Engineering. Having a simple architecture made up of a few good tools will reduce errors, and bugs, and increase the speed of development and features.

One thing Data Engineers often forget is that every time you add yet “another” tool to the stack, you are adding another …

breakpoint
integration point for other tools
tech debt
more complexity
more decisions

The more you work in and on Data Platforms, the more you will realize that complicated is not good and is often unneeded except in exceptional cases.

I’ve worked at startups that have a good 50/50 split between Postgres and MySQL! I mean really! Just pick one already!

Are you a Data Engineer who wants to grow in your team, build the next new best thing, or do something awesome?? Then do the opposite of what everyone else is doing … adding new things and new features and this and that. Simplify. Combine. Reduce.

Thank you for reading Data Engineering Central. This post is public so feel free to share it.

Not having a good local Development environment.

The next most common mistake has probably been experienced by everyone lucky enough to be reading this article. You show up to a new job, excited and nervous, ready to do some small work, you’re a good Engineer.

But, surprise surprise, when you clone the repo(s), that may or may not contain all the code running in production … you crack your fingers, get ready to write some code, and run the tests … just kidding … you CAN’T!

Come to find out they …

they want to write tests but haven’t got there yet.
no one has had anytime time to set up Docker and containerize the tooling and code
everyone just installs stuff on their machine and does whatever
some people log onto a “Dev box” instance and test things there

I can think of a few things more depressing and that are more of a harbinger of bad times to come.

This is a common downfall of many Data Teams, the unwillingness or inability to spend the time that is needed … upfront … to ensure Development Environments are set in a way that makes it EASY to write, develop, and debug new or old code.

You should ask yourself these questions.

Can anyone clone our repo and run our tests (if they exist)
Does someone have to spend a week installing a plethora of tools on their laptop to even get to the point of writing code?
Do we have a Docker image or some containerization of the tooling and stack?
Do we have a standard local development process for our Engineers?

If you don’t have a reasonable way for Engineers to write, develop, and test codes on their local machine … why don’t we all take a wild guess where most of the code gets tested and run?

Production.

Not having good orchestration and dependency management tool

These days there is not much excuse for this problem, considering all the good options out there on the market. Some people might not think a good Orchestration and Dependency management tool for all their data pipelines matters, but I assure you it does.

For example, do you have a single place that anyone on the Data Platform can navigate to, visually, and see ALL of the running Data Pipelines along with the status of those processes??

This is key to onboarding team members, getting other teams and people up to speed, being able to find things, troubleshooting things, and generally being productive and having VISIBILITY into the entire Data Platform and how things are running.

The time is past for CRON jobs, sorry.

Again, simplicity matters. You want discoverability and visibility into the Data Platform operations that are running with a few clicks of a button, this should not be a difficult or confusing task … scattered around or requiring CLI commands.

Not testing code and pipelines before releasing.

We sort of talked about testing, locally that is, a little bit ago, but this topic deserves more commentary, and it really comes down to “Testing in Production,” which happens all too often in Data Engineering communities.

There are some many factors that play into why this happens, some very technical, some more cultural.

Lack of ability to test locally
No Development Environment that mirrors the Production
Hustle culture of “getting it done”
Lots of pipeline breakage and fires (self-fulfilling prophecy)

Based on my experience the whole “we don’t test, but if we do it’s in production,” comes down to culture, and it’s all driven by the leaders of the Data Platform.

If said leaders only care about delivery, and live in a constant state of breakage, with zero ability to call for a slowdown and focus on root cause and the improvement of the development lifecycle, then a culture of getting it done will reign supreme.

It’s hard work. Creating end-to-end integration testing is hard, unit testing sorta but not really. It takes some serious willpower to go from zero to a fully capable Development environment that mirrors Production and where new code and changes can be staged and tested.

Sure, unit tests can catch SOME bugs, but they will not catch all. Data Platform needs to have the ability to run end-to-end data pipelines just like it will in Production.

Again, this is another area where Engineers can have a big impact.

Thank you for reading Data Engineering Central. This post is public so feel free to share it.

Not doing something hard.

This last piece of advice is for all Data Engineers, old and young alike, new or seasoned professionals. I saved this little bit till last, but for individual contributors, this is probably the single most important thing you can do.

And no mistake, although it’s good for you as an individual, it feeds back into the bigger picture and team you work on.

You have to do hard things … on a semi-regular basis.

Learn a new piece of technology.
Write code in a new language.
Work on a non-coding skill (leadership type).
Read a new-to-you book at least once every few months.
Build something interesting (or contribute to open-source).

It’s hard to go to the gym. Hard to eat healthy. Hard to get enough sleep. But, we all know the benefits of doing these things, they pay back in spades.

The same goes for you as a Data Engineer. To grow and become better, you need to do hard things. The real truth is you are either coasting (which is fine for rest or other reasons) but for your own good and long-term happiness … you need to do hard things.