I'm sure at least one of those times you were doom scrolling Reddit or LinkedIn when you should have been learning, you ran across some vague argument about immutability. Am I right? I'm right. At least about the doom-scrolling part.
I would wager you might even see the topic come up more in the data engineering community sometime in the near future. Many times this topic of immutability comes up when some Scala, Java, or Rust savant is bashing a poor old Python bloke about being a little weasel, a dreg from the bottom of the proverbial barrel.
I’m guessing the majority of my readers may well be Pythonistas, happily churning out 3x the code of those Scala and Rustaceans, but I dare see we can learn a little something from them clever folk.
What shall we cover on this wonderful day?
Overview of immutability.
Benefits of immutability for Data Engineering.
Examples.
Closing Thoughts.
Immutability. Friend or foe?
In case anyone of you fine readers is from a county with only a single working stoplight, like yours truly, let's talk about what exactly is immutability.
“Something that doesn't change.” - me.
You would never know I'm from a small town, would you? Maybe the collective wisdom of the internet over on Wikipedia can shed some light for us.
“In object-oriented and functional programming, an immutable object (unchangeable object) is an object whose state cannot be modified after it is created.” - Wiki
So what do we Data Engineers think? Is immutability a friend or foe?
I would say immutability is our friend, and I hope to convince you of this. At the very least I think the underlying concepts and ideas of why to implement immutability in software can serve you well, even as you write your precious Python.
When you are approaching a piece of software, for us data pipelines, it doesn’t matter if you’re using OOP, more functional style, or something in between, you have to make choice, even if your choice is just a default because that’s all you know.
Every variable, object, and the results that you play with, must fall into one of two categories.
Immutable.
Mutable.
Thanks to Delta for sponsoring this newsletter! I personally use Delta Lake on a daily basis, and I believe this technology represents the future of Data Engineering. Check out their website below.
If you’ve not thought about immutability that much, it may seem like such a small difference. But, I assure you, it has important implications for you how you reason and design your software patterns, data pipelines, or otherwise.
Let’s dive into some more real, rubber-meets-the-road thoughts on immutability and data engineering design patterns.
Benefits of immutability for Data Engineering.
Well of course there are probably reams of internet paper written on many varied and sundry reasons for the benefits of choosing immutable objects when programming. I will relegate that to other smart folks, and I will try to stick to the simple and obvious benefits of immutability.
Here’s my list.
Objects will be more thread safe inside a program.
Easier to reason about the flow of a program.
Easier to debug and troubleshoot problems.
Easier to test the codebase.
As Data Engineers we should always strive to reduce complexity, keep bugs from reaching production, and generally put our best foot forward whenever possible.
Part of that journey is learning and growing, even say, if we are Python users, we can understand maybe how Scala and Rust use immutability, and how it benefits those codebases, and then try our best to pull those best practices and ideas into our life!
I think it might be best to look at some of these benefits using some pseudo-code examples, as it’s a little easier to paint the picture we’re looking for.
Examples.
Let’s take a look at some of the benefits from a Data Engineering perspective. First, let’s talk about thread safety. This probably is only relevant to those Engineers who are building platforms, but it’s good for all of us to understand some of these topics a little bit.
It’s kinda obvious, but immutable objects are going to be safe, in the sense, that different processes or threads on a CPU reading the same object is going to the same and with no surprises.
“The next three points we mentioned above can be lumped into a single statement for Data Engineers. Immutability makes life easier.”
Let’s look at this example. Something like this code as follows would not be uncommon to run across in a lot of data pipelines. But, this approach has its downfalls and ends up being how most mutable code is written.
It just becomes confusing, even in a simple example what is exactly happening and what’s happening during the dataflow. When most pipelines are many times more complicated than this example it can become a nightmare to debug, add functionality, or troubleshoot that codebase.
With all objects being mutable it becomes extremely important to know what’s been changed and when.
Code can’t be debugged in a modular way, you must start from the beginning to ensure an object’s state or expected value is what you think it is at some distance point.
Mistakes are much more likely to happen, mistakes that won’t nessesarily be caught by unit tests.
Writing code where objects are immutable by default leads to a different kind of experience. Trying to write code in Python that you pretend is immutable, while not the real deal, will probably make you write in a different manner. Most likely your code will be more functional, testable, and overall more trustable.
Closing Thoughts.
I hope to open up this topic of immutability and ownership again in the future, with some more in-depth examples using Rust. For Data Engineers there can be big payoff when we switch from Python’s world of “anything goes,” to a mindset where we are more reasoned and careful about mutating objects’ state and values.
We don’t always have to drink the kool-aid we are handed, but a good sip here and there can help color the way we see the world and solve problems.
If you’re new to the whole mutability discussion, check out an article or two to get a taste, here is one for Scala, and one for Rust.