It takes me a lot of sweat, blood, and tears to bear the brunt and ire of the internet overloads to bring you raw and unfettered Data Engineering content. It would mean a lot if you would consider becoming a paid subscriber and help me keep doing the Lord’s work.
I feel like I’m sitting on the edge of the ocean, not a nice wonderful sandy beach where the gentle waves are lapping at my feet and the breeze blowing through my non-existent hair, but rather a boiling and roiling ocean pounding the shoreline and destroying my beautiful castle of sand (ideal Data Engineering world).
This is my parable. Heed my words all you heathens.
The ocean is the Data World. The pounding waves are the insolent peddlers of destruction embracing a Notebook-driven world, they are zealots bent on the destruction of all I hold dear … fundamental truths held since the beginning of Data Time.
These rapscallions are intent on casting down laws of the universe we’ve held in common for low, these many years.
proper development lifecycles and practices.
doing whatever you want, whenever you want, with impunity.
teaching others to do the same
We must band together, the faithful few, and fight like Spartacus against the pagan hoard of Notebook Engineers arrayed against us.
The atrophy of Data Engineering … lead by Notebook Engineers.
Ok, let’s cut right to the heart of the matter, if you don’t agree with me (meaning you abuse Notebooks and encourage others to follow you in your wickedness), hear me out, and let me set you straight. Repent in dust and ashes.
One thing is clear.
99% of Engineers and Data Folk who regularly use Notebooks as part of their development and production lifecycles … abuse, overuse, and do so at their own peril and the peril of their Data Platforms at large … and suffer the grave consequences as such.
The only Data Engineers who laud the destructive use of Notebooks are those that are either …
skill issue - it’s all they can technically do
rabid snake oil salespersons of giant Corporations
In fact, to do an experiment of sorts … BEFORE this article was published, I posted a preview of the title on my social media … knowing full well I could bait the pundits to prove my point.
And low, to my great not surprise … The Great Dragnet caught something … a well-known Databricks-sponsored pundit saying they love notebooks.
The rise of the Notebook Engineer has been orchestrated and sponsored by not The Average Engineer … nay … but it’s been peddled by the modern Data Stack creators hell-bent on hooking and addicting their unsuspecting victims into practices that waste time and money, fostering bad engineering, and care little for widely accepted industry best practices.
What, you say? You think I wax strong and poetic? How dare you.
This has been happening for years. Case and point.
Go read the comments in the above Reddit conversation, then decide for yourself. This is only scratching the surface of the woes visited upon us by those sinners.
People use Notebooks because it’s easy, low barrier to entry, and is the shortcut to getting things done. They can do what they want, when they want, getting the dopamine hit right off the bat without putting in the proper work.
Notebooks are like drugs … use sparingly at best.
Heck, I have no illusions that I, piddly and small me, can fight against these behemoths and Titans of Industry. I don’t have Acme Corporation and their marketing machine at my back … I’m just a lone prophet standing in the breach calling the faithful to fight back.
Anyone who has a reasonable amount of experience in Data knows some things …
Data Engineering and Platforms have classically lacked and lagged behind acceptable SWE best practices.
Data Teams are largely made up of non-SWE background folks (I count myself in this number).
Vendors push tools and processes that at one point were meant to help, but over time they, and their followers, become addicted to their own hurt.
This is what I’m against. I’m not against Notebooks in themselves as a critical piece of the overall Development lifecycle. They are critical for Data Analysts and Scientists.
They have a purpose and they serve it.
But, they are absolutely NOT critical for a good Data Engineering lifecycle.
The problem is when, say a vendor (cough cough), publishes every single tutorial they have using Notebooks, markets them, and literally TELLs people to use them in Production.
They will spout nonsense like the ability to put Notebooks in Git repos etc … no crap … I can put a picture of Mom in a Git repo if I feel like it … SO WHAT??
I want you to re-read that comment from Reddit above at least twice. Just let it sink into your hard heart.
The truth is that someone who uses Notebooks as their main Development tool, in the vast majority of cases (not in every case you ding dong), are doing so for very specific reasons.
Unfamiliar with local and IDE development (classic SWE) lifecycles.
Do not test code at all, and have no plans to test code.
Don’t have a reliable way to deploy code to production (aka poor CI/CD, they have no choice but to use Notebooks).
Simply skill issues
A workplace culture of get-it-done attitude only.
Simply rely 100% on their vendor for all things like a little birdy with their mouth wide open.
Well, what do we do? I guess folk will have to learn their lessons the hard way. Humans are designed to err, don’t like change, and are liable to follow the crowd like a lemming.
I adjure you to not follow them down their vain path of tears and heartache. That notebook may taste as sweet as the honeycomb in your mouth but will turn bitter in your belly. It will poison you over time unless you are already in the 5 AM workout club and have immense self-control.
For the few angry Notebook zealots out there, I assure you I know there are probably a few unicorns running around out there able to use them wisely (I am one), but sadly, this world is mostly made up of horses and donkeys.
If I, a lone ranger of sorts, can convince one single Data Engineer to embrace a classic SWE development lifecycle, and SLOWLY start to reduce their dependence on Notebooks … then they are a diamond in the rough, destined for greatness, an Engineer worth hiring who will grow in their knowledge and understanding.
Guilty as charged. When creating new models/exploring new modeling paradigms/doing iterative and in-depth experimentation/explaining complex topics I simply have not found a better alternative. They're rich, portable, and are implicitly designed to support iterative workflows while keeping data in memory which saves me, probably, a billion hours a day.
new problem -> a series of notebooks as I explore the problem -> the final notebook where I solve the problem -> save artifacts -> test artifacts -> build production code around artifacts.
Modern version control systems and coding environments are not designed for the complex and experimentative work that usually represents the impetus of a data science project. That's what notebooks are for. But of course the notebooks themselves never run in a production setting, or at least I hope to god they don't.
(This is all for data science work. For data engineering I'm not really sure why they're so popular)
For a long time at the start, i too was addicted, no reusablity, became a big pain. Thanks for bringing it all together so nicely.