I still remember quite clearly when I had my experience as Data Engineer working in the Machine Learning space. It was the wild west all those years ago. To say I was nervous was an understatement. My innards quaked and quailed like the trees and grass when that first bit of a storm brushes across the landscape. Trying something new, especially something that seems hard on the face is always hard.
I didn't know what to expect. Like most engineers, I had dabbled around myself in the ML world on my own time, and my thoughts on what Machine Learning actually was, were very skewed. But I didn't know this.
If what you’re understanding of Machine Learning is what you’ve read on Medium, than you don’t know ML very well.
I was a nervous little blighter, one minute convinced I could work my way through any problem, as I had in the past, and the next, convinced this would be my first spectacular failure. I would stumble, my brain would stop, and I would be ridiculed and driven from among the human race, doomed to wander the woods like Radagast the Brown, eating mushrooms and ruminating on my failures.
Thanks to Delta for sponsoring this newsletter! I personally use Delta Lake on a daily basis, and I believe this technology represents the future of Data Engineering. Check out their website below.
But, like most imagined nightmares, the reality was much different. So let me spin a tale of Machine Learning for you, the truth of the matter. The one no one seems to talk about.
Being a Data Engineer and working on Machine Learning problems is way easier than you think.
The truth of the matter.
First, as with most things in life, there is always a sliding scale of difficulty. The same goes for Machine Learning. But, the difficulty isn't what you think.
The above picture is my take on where the “difficulty level” of Machine Learning falls in the strata of Data Engineering work. It isn’t the most difficult work. By far working on Distributed Systems and Big Data is much more challenging. Sure, ML work can be harder sometimes than the general ETL and Pipeline work of day-to-day Engineers.
But in the end, ML work falls somewhere in the middle of it all. Maybe slightly to the right of the center.
What Machine Learning actually is.
Sure, there are some Machine Learning Engineers who are expected to be both Data Scientists and Data Engineer sevant. But this is the exception, not the rule.
Most Data Engineers working on ML systems are just normal Data Engineers, the only difference being they understand the lifecycle of ML and have specific MLOps experience, an experience which anyone can learn.
I'm sure you've heard that 90% of ML is data work, while that is true, it does gloss over some very important MLOps work that Data Engineers take on to enable ML to happen at scale, and in production.
This is what most ML work done by Data Engineers consists of. No magic here.
Ok, so this figure might not encompass absolutely everything, but it does cover 90% plus of what Data Engineers working on normal ML systems, at normal companies are going to do.
What you see above really encompasses the most important parts of MLOps, with of course some detail about each one. Here’s the truth, I’ve worked with putting ML pipelines into production for years now …
I can count on one had the number of times I’ve had to hyper-parameter tune a model, and do other “difficult” tasks.
Those are left to the Data Scientist, sure I learn at the surface level what they are doing and why, but the rest of the time is spent “hardening” and automating what some scientist is doing by hand.
So, you want to know what kind of ML model to pick for what use case, want to learn about over-fitting, parameter tuning, and other such things? Then be a Data Scientist or if you’re a unicorn, be an ML Engineer.
Otherwise, just be a Data Engineer and learn about ML operations. Let me give a quick overview.
Features
Features. They are like the golden nugget of Machine Learning that no one ever talks about. I have always found it strange that the peddlers of Machine Learning greatness on social media rarely, if ever, even mention the word features or feature store. I mean what good is a model without features to feed to it? How can you train a model without features?
“In the context of machine learning, features are individual measurable properties or characteristics of a phenomenon being observed. They are also known as predictors, input variables, or attributes.
Features are used to represent the data in a form that the algorithms can understand and process. They form the "input" part of a machine learning system, while the "output" is the prediction or decision made by the model.”
In other words, feature stores are the beating heart of most ML systems and pipelines. There are two option for feature stores, build your own or uses someone else’s. Some options are …
Feast (open-source)
Or you could do as I’ve done in the past and simply build your own on top of something like Postgres. Because what is a feature store after all?
Simply a single-stop shop for ML features that are ready for production. As long as you catalog how the features were produced, what data produced them, when they were produced, and any other relevant meta information … you my friend, have yourself a feature store.
How would a feature store be used in real life?
In the world of Data Science, especially with multiple folks working on many models, a feature store enables you to say “Here are reliable and ready-to-use features.”
Also, feature stores can help support fast model building and training by allowing some Scientists to say “Give me x, y, and z features produced between this and that date.”
This is the basis and the underlying idea behind feature stores and how they provide easy, quick, and centralized access to a vast time series of features to support quick mix-and-match model training, as well as production workload predictions.
Model Prediction and Training.
Alas, we could pontificate upon features ad-infinitum, but at some point, we must move on. The place to which we now journey is another very basic, seemingly simple, but often ignored topic, simply the training and exercise of models for prediction, as well as storage for those predictions.
I know it sounds boring, but honestly, if you can’t train, predict on a model, or do analysis on what was predicted … you’re in for some angry and grumpy Scientists.
There is a very high probability during the R&D period of model development some Scientists will write a script of some sort … to train a model, and then predict that model.
They will also start to run, re-run, and re-run a million times over those same scripts with slight changes. Maybe this or that model, or this or that feature(s). They will predict, analyze, and start all over again.
It’s a lot of honestly, repetitive tasks. The needs are simple. One or two commands or button clicks with some passable configs to train and predict on models. Also, easy access to metrics from training and prediction runs, with plenty of metadata.
Sound like feature stores? Not that much different, just some engineering excellence around automation, tooling, and tracking of what has actually happened. It’s just the application is Machine Learning.
Again, build your own or find some open-source stuff. Just depends on your use cases, the size of the company, and the number of models you work with.
Automation and MLOps.
So this is pretty much the summary of the first two items, but it’s so important it’s worth a little side note. What it really boils down to as a Data Engineer working on ML Pipelines is the automation of everything.
automated training.
automated feature production.
automated prediction.
automated R&D tasks.
automated model deployment.
automated everything.
Really it’s just about picking the correct tooling or building your own tools, or maybe it is just a mix and match of both. What is important about the MLOps is that all those Operations are automated and configurable, and therefore the friction is reduced to build, test, and deploy Machine Learning models and pipelines.
You want to make it easy. Easy and hard to mess up. This brings value and enables quick iteration and model creation … it provides real business value.
What is probably one of the most popular ways to automate some of these tasks? I personally have used Apache Airflow twice to build production-grade ML pipelines and Ops. Why are people surprised Airflow is used? Beyond me. Airflow is used all across Data Engineering, and MLOps is not really any different in the end. Of course, Airflow is used.
Meta Data Storage.
The last piece that has already been mentioned, but bears the worth of being brought up again, is all the metadata storage that comes along with MLOps.
This is a somewhat unique part of the Machine Learning lifecycle that isn’t talked about much. Probably because it just isn’t cool or all that interesting.
An astute observer would have noticed in the above discussion all the references to storing data, what happened, how it happened, and relationships between data.
When a model was trained.
What data a model was trained with?
When and how features were produced.
What model produced some predictions?
What features were used to make a prediction?
Metrics around training.
The list goes on. Meta-data storage about what has happened, what is happening, and how a model was produced, its predictions, etc, are at the core of being able to run a reasonably stable ML environment.
What most people won’t tell you.
That’s what most people working in ML won’t tell you. Sure, there are complicated parts, just like with any Engineering task. But, why don’t you ever read about these basic tenets of good ML on Medium or Linkedin?
Probably because they are mostly boring and fairly straightforward. No magic is to be found. Just simple automation, best practices, tracking, and storage. Things that Data Engineers do on a daily basis anyways just applied to a different class of problems.
Hopefully, if you’re new to ML I gave you a taste of the real world, a little shove towards something you might have thought out of your reach because it’s not.
I do like the simplicity of this article. However, there is a lot more to MLE work than these basic tenets. It is a lot of setting up best practices for data scientists to follow, collaborating with data engineering and data analytics to look at use cases, etc.