Test, test, and then test again.

... or else ...

Mar 17, 2025

When it comes to testing, we are all our own worst enemies. It's the human in us coming out. We know we shouldn't eat potato chips on the couch and watch reruns, but we do it anyway. You know you shouldn't merge that PR, but you do it anyway. What's the worst that could happen?

In moments of weakness, I still fall into that trap, and I've foot-gunned myself doing crap like that for over a decade, but it still happens.

People talk about TDD like it will somehow prevent bugs from making it into production. They are idiots. There is no such thing.

Bugs are to software what SQL is to r/dataengineering.

It could be a lost cause, after all. But if you look at why production gets broken, especially in the Data Platform context, it often comes down to testing.

Thanks to Delta for sponsoring this newsletter! I personally use Delta Lake on a daily basis, and I believe this technology represents the future of Data Engineering. Check out their website below.

We should ask ourselves when, why, and how untested code sneaks its way into production and blows up our Slack channels at 10 p.m.

Missing unit tests.
It's just a "small change," so I will push it.
In a time crunch, just getting &$##@ done.
No integration tests (end to end).
No development environment.
Poor dev lifecycles (the barrier to testing is too hard).
Poor engineering culture.

I'm sure we could continue forever, but what do you notice about this list of stuff?

It's not only technical; it's human-centric as well.

Thanks for reading Data Engineering Central! This post is public so feel free to share it.

These two things collide and often explode like fireworks, causing pain, suffering, and general mayhem. The combination of weak testing and unhelpful human systems surrounding a data platform and team is nothing short of apocalyptic.

How to stop breaking production.

It depends on the day as to whether I'm a glass 🍷 half full or half-empty kinda guy. I'm not under any allusions that one can simply stop all bugs from reaching production.

But, I will poke you in the eye with your Grandma's butter knife she's been using for 30 years if you tell me it's impossible to reduce the chance of bugs making it into production significantly.

I will be honest: the number one thing on my list that must change first is that you aren't going to like it. It's a buzzword. No one hates buzzwords more than me. 🐝

Culture.

No amount of pretend testing ideals can overcome an engineering culture that has consistently prioritized getting &@#$ done at any cost over a long period of time.

Be honest, don't lie; you know what kind of culture you are sitting in at this very moment.

Changing a culture takes time and some 🏀, but it's probably one of the most important steps you and your team can take. No, it doesn't happen overnight. No, you can't tell your boss to go pound sand. ⌛

But, it does change a culture can start by simply …

Asking for time to test
Planning for testing time in projects and tasks
Asking others in PRs if they have tested the change
Using language, both written and verbal, to bring up the idea of testing
Saying "no" more (no about XYZ until it's tested)

Culture will take time, months or a year(s), but I can be done, it must be done, you will not stop breaking production with catchable bugs until culture changes, no matter the tools in place.Data Engineering Central is a reader-supported publication.

Unit and Integration Testing

It doesn't need to be said, but in data engineering, there is an adverse effect on testing in the proper sense for some unknown reason.

It's amazing how many Data Teams I've talked to don't take this very simple first step and then complain about how unreliable things are.

It's not rocket science.

Every method should be unit-tested.
End-to-end integration tests must exist for every pipeline.

If you can't unit test your code, it means your code sucks, and it's no surprise things break. Unit testing across the board enforces clean, modular code. Automated unit tests catch lots of "stupid" errors and mistakes.

Integration Testing inside a Data Platform is the holy grail of stopping 90%+ of pipelines from ever breaking. Data Systems can differ from classic Software engineering in this way: the number and complexity of modules that "data flows through" is often complex and error-prone if not tested at this "higher level."

I don't care if you TDD or not; leave that argument for the engineers who are obsessed with their own genius.

The idea is to create a great dragnet that drapes over your entire codebase. It catches unsuspecting junior engineers and occasional overconfident seniors who stray from the path and try to push something funny.

Unit and Integration tests are this dragnet, catching all sorts of strange things from the depths of coding hell.

It's just a small change + I’m in a time crunch.

I have to be honest, this is the one that occasionally gets me to this day. Doesn't matter how many times I swear to myself “This is the last time I will ever push a small change to production because it's so simple… and promptly break everything.”

This is the most classic bug of them all going into production.

What seems minor from your perspective might be deeply interconnected with other parts of the system. A one-line code change can trigger unintended side effects, introduce subtle regressions, or break functionality in unexpected ways. Some of the most infamous production failures in tech history stem from seemingly insignificant changes that weren’t properly tested.

Case and point.

1. The Knight Capital Group Trading Disaster (2012) – $440 Million Lost in 45 Minutes

The Small Change:

A minor code deployment mistake in Knight Capital’s high-frequency trading system introduced an error that caused the system to buy high and sell low at an insane rate.

The Consequence:

Within 45 minutes, Knight Capital lost $440 million, wiping out the company and forcing a desperate sale to a competitor. The issue stemmed from a failure to test new deployment logic properly.

The Time Crunch Fallacy

Skipping tests to save time is an illusion. A bug that slips through now will likely be far more expensive to fix later, especially if it makes it to production.

Debugging in a live environment is stressful, time-consuming, and can damage user trust. In reality, taking a few extra minutes to write or run tests now can save hours (or even days) of crisis management later.

Practical Approaches to Testing Under Pressure

If you’re truly in a time crunch, here are some pragmatic ways to balance speed and safety:

Prioritize Critical Tests – If you can’t run everything, focus on the high-risk areas (e.g., core logic, authentication, payment processing).
Automate What You Can – Invest in unit and integration tests beforehand so you can rely on quick, automated checks when you're short on time.
Leverage Feature Flags – If the change is risky, consider deploying it behind a feature flag, allowing you to enable or disable it without redeploying.
Test in Production (Safely) – Use canary deployments, observability tools, and error tracking to detect issues early if full pre-release testing isn’t possible.
Get a Second Set of Eyes – Even a five-minute peer review can catch obvious mistakes before they cause damage.

A Mindset Shift

Instead of viewing testing as a blocker, see it as an insurance policy. The cost of a small delay to test is almost always lower than the cost of a major failure later. The next time you hear yourself thinking, “It’s just a small change,” remember: the riskiest bugs often come from the things we assume are safe.

Thanks for reading Data Engineering Central! This post is public so feel free to share it.

No Dev Environment or Poor Dev Lifecycles.

Imagine trying to fix an engine while a car is speeding down the highway. That’s what developing software without a proper development environment feels like. Worse, when testing is so painful that engineers avoid it, mistakes inevitably slip through to production.

These two problems—lack of a dev environment and poor testing lifecycles — are common causes of broken deployments, costly outages, and endless firefighting.

No Development Environment = Testing in Production (Whether You Want to or Not)

A lack of a dedicated development environment means every change is essentially a gamble. If your only option is to deploy directly to production, you’re playing with fire. Without a sandbox to test changes safely, even a simple code tweak can cause massive failures.

Some consequences of skipping a proper dev environment:

Fear-driven development – Engineers hesitate to make changes, slowing down innovation.
Hard-to-reproduce bugs – Without an isolated test environment, debugging is chaotic.
Costly production failures – Unverified changes lead to real-world customer impact.

When Testing is Too Hard, Engineers Skip It

Even with a dev environment, if your testing lifecycle is painful—requiring long setup times, excessive approvals, or complex CI/CD configurations—developers will naturally find ways around it. When the barrier to testing is too high, you get:

"YOLO" Deployments – Engineers bypass tests and push changes directly.
Slow feedback loops – If running tests takes hours, no one will run them frequently.
Stale, broken test suites – Outdated or flaky tests reduce confidence in automated testing.

I’ve seen this many times before, an incredibly web of complex code and data, requiring hours to bring up, before a change could be properly tested. It wastes time and everyone stops testing eventually.

I get it.

Yeah, I’m still like you even after all that. Doing the right thing is hard when you want to move fast and build things. The balance and struggle is real. If you work in a fast-paced culture that expects you to get things done, you sometimes do what you must.

No one wants to work in the kaki-wearing corporate culture where every single code change is tag-teamed by two people. Then you have to wait for the Dev testers to do their thing, wade through a two-week PR process … etc. That’s soul-crushing.

Thanks for reading Data Engineering Central! This post is public so feel free to share it.

Summary: Testing for Software Engineers—Why We Keep Breaking Production

Testing is often treated as an afterthought in software engineering, especially in fast-moving data teams. The reality is that bugs will always exist, but how often they reach production is entirely in our control.

At its core, the problem isn’t just technical—it’s human. Engineers cut corners, skip tests in a time crunch, and push “just a small change” without considering the ripple effects. The result? Late-night Slack alerts, broken data pipelines, and production meltdowns.

Why Does Untested Code Reach Production?

Missing Unit & Integration Tests – The foundation of reliable software.
"It's Just a Small Change" Thinking – Even one-liners can break everything.
No Development Environment – Testing in production is a disaster waiting to happen.
Poor Dev Lifecycles – If testing is too hard, engineers will find ways to skip it.
Toxic Engineering Culture – Prioritizing speed over quality leads to long-term pain.

How to Stop Breaking Production

Shift Left on Testing – Make testing an early and natural part of development.
Automate Where Possible – Faster tests mean engineers are more likely to run them.
Make Testing Easy – Reduce friction in CI/CD pipelines, and support local testing.
Encourage a Testing Culture – Testing isn't just a process; it's a mindset shift.

The Hard Truth: Culture is the First Thing That Needs to Change

No tool, framework, or process can overcome an engineering culture that treats testing as an afterthought. Fixing this takes time, but small steps—asking for time to test, planning for testing in project roadmaps, and holding each other accountable—can shift the balance toward quality.

At the end of the day, testing is an insurance policy against production disasters. The choice isn’t whether to test—it's whether you'd rather catch problems early or scramble to fix them in the middle of the night

Michael Berry

Mar 17

This has to be one of the best articles I've read this year! Going to request my entire team (and upper management) read it. Thanks!

Expand full comment

Kiran Adapa

Apr 4

Thank you for a great article. I am going to share with my teams to read this post and discuss. As you mentioned the lack of testing and the rigor that is required to Test is not just technical —it’s human. Organizations and engineering teams should focus on identifying areas that introduce friction for the engineers and reduce or eliminate need for Engineers to cut corners. Additionally, embedding and exposing the engineers that build software/data pipelines to be exposed to the operational issues often and early will make them realize the risks of missing Testing. Shift left as much as possible and invest in improving DevEx.

1 more comment...

Test, test, and then test again.

... or else ...

How to stop breaking production.

Unit and Integration Testing

It's just a small change + I’m in a time crunch.

1. The Knight Capital Group Trading Disaster (2012) – $440 Million Lost in 45 Minutes

The Small Change:

The Consequence:

The Time Crunch Fallacy

Practical Approaches to Testing Under Pressure

A Mindset Shift

No Dev Environment or Poor Dev Lifecycles.

No Development Environment = Testing in Production (Whether You Want to or Not)

When Testing is Too Hard, Engineers Skip It

I get it.

Summary: Testing for Software Engineers—Why We Keep Breaking Production

Why Does Untested Code Reach Production?

How to Stop Breaking Production

The Hard Truth: Culture is the First Thing That Needs to Change

Discussion about this post