Immutability for Data Engineers

Nothing ever changes ...

Nov 22, 2023

What’s the old saying … “Nothing ever changes?” Same old same old. I guess sometimes that could be a good thing.

Anyone who’s played around in a giant Python codebase before, a literal morass of code spewed out over years into a purifying tangle of variables that takes a wizard to divine … would agree that having something not change would be nice … for a change.

Immutability. A concept not much used in most Data Engineering circles, but one that could be the savior of us all. Take us away into the promised land o’ thou glorious immutability.

You should check out Prefect, the sponsor of the newsletter this week! Prefect is a workflow orchestration tool that gives you observability across all of your data pipelines. Deploy your Python code in minutes with Prefect Cloud.

Why immutability matters in Data Engineering.

This is a most interesting topic related to Data Engineering, data being immutable. It’s safe to assume a fair number of my readers have never thought about data as being “immutable” when writing their most recent Data Pipeline.

First, we should probably try to define immutability in a Data Engineering context.

“Immutability in the context of data engineering refers to the concept that once data has been written (into memory or disk), it cannot be changed or deleted.”

Immutability applies to data at all points during the data processing lifecycle.
It isn’t limited to data just written to disk.

Immutability can mean different things to different people, each in their context. If you’re a Rust or Scala developer you might be thinking about variables. This is correct, at least when thinking about a Python codebase, these other languages would have different approaches.

This is from a programming and code perspective. But there is also a pure Data perspective, regardless of the language of choice. How do you treat your data in memory or at rest?

Do you touch it, mutate it, how do you track through your system and code a piece of data?

Everything is harder in a Data Context.

Let’s say you write your data pipelines in Python. Complex data pipelines. We know we can’t really have immutable data as such, not like Rust, when coding in Python. It’s more of a theoretical approach to how you should write your Python pipelines.

Let’s take a Polar’s Python approach. Ok, so we all know production pipelines are very complex and are twice as big and complicated.

What do you notice about this code? Well, it cares nothing for even the concept of immutability. It just embraces Python and cares about nothing else besides shoving the data through the pipeline. Everything is the same dataframe (df). That’s simply a dangerous way to approach life and problems.

In a real-life use-case, things would be much more complicated with many different calculations and this and that going on. I would argue code that doesn’t think in an immutable way becomes exponentially harder to reason about, work on, and debug.

With no philosophical or real immutability, it’s hard to find what you're looking for.
In a complex codebase with no immutability, debugging is a major task. It’s hard to know the “state” of things.

Let’s be honest, saying what we mean and treating each data point as “immutable” even in Python makes things more clear. Easier to reason about, easier to debug, and the like.

True immutability in code.

Thank you for reading Data Engineering Central. This post is public so feel free to share it.

With Python, we can’t have immutability in code. We can dream about, think about, and act like it, and gain some benefits when approaching our data throughout our codebase as immutable, but it’s still a pipedream.

This is why some teams and folks chose to dive in head first. Static languages solve this problem, think Rust and Scala. With these tools, immutability can move from an idea to a reality.

Let’s take an example of some Rust code that downloads a CSV file and does some processing. Here is a snippet of some code.

In Rust we have two types of variables we can define with the `let` keyword. By default these values are immutable. They cannot change.

If we want to mutate something, we have to be explicit … we must `let mut`. This has a major impact on how we write and reason about code in a Data Engineering context.

In Rust, unlike Python, we can’t simply just make variables or “objects” to hold data, pass them around willy-nilly, do stuff to them, pass them along again, do more things.

I mean maybe we could if we tried, but it goes against the grain of how most folk write Rust. For example, some time ago I was writing some Rust code to process a graph.

When dealing with Dataframes in Rust that are a mix of mutable and immutable, the code is much more verbose.

But, it has the benefit of being obvious when reading it. It’s not confusing what is what, it’s hard to “mess it up.” Everything isn’t “the same df variable.” Simply because it can’t be without extra work.

Code that treats all data as immutable has lots of benefits.

simplicity and predictability.
- no need to reason about data structures changing state.
fits well into modern architecture.
concurrency and parallelism.
- data that doesn’t change can easily be thread-safe.
cache-friendly.

There is also another piece of data immutability that has a place in Data Engineering, namely data at rest (outside our code).

Immutability in data at rest.

Another aspect of immutability in Data Engineering that is usually overlooked, or at least not thought about, is our data at rest. Think about it, most of our data is sitting somewhere, in an s3 bucket, in a Lake House, or in a database.

How do we treat this data?

I’m sure you work on data platforms where data in its various forms and states is written to various and sundry sinks. What can you reason about this data? I’m sure questions come up. Is it immutable? Had it changed? Can it change?

These are serious questions with real-life implications.

Data sinks that are immutable in nature can provide clear benefits, much like code that treats data as immutable.

historical accuracy and audibility
- you can track things correctly that aren’t in flux.
data consistency in large distributed systems.
recovery and failover plans.
integrity.
- less likely to be messed with and broken.

This can be a complicated topic, intermixed with data models, but think about it. What if you ingest raw data and assign some composite primary key made up of a few columns to help identify data as it flows through downstream systems?

What happens if someone comes along and updates a customer_id or name that is part of that primary key … aka something that should be immutable in theory?

Do you really want that data to be updated? Probably not, maybe you need to model the data as needing a new record that replaces another.

Data that is mostly immutable at rest is much easier to work on. Anyone who’s worked around data long enough knows that the classic UPDATE statement … something becoming mutable, is probably the core of some of the biggest bugs and data warehousing fiascos to ever strike unsuspecting victims.

The complexity of cascading changes. Akkk.

Not everything fits perfectly into an immutable world, data is complex, but, Data Engineers who approach designs and models with an immutable mindset first … will build far more hardened and useful systems in the long run.

What have your experiences with immutable vs mutable data been like?

Do you have horror stories? Do you fall into one camp or the other? Do you even think about data as being immutable or not? Please share your thoughts below!