I’ve been looking forward to coming back to my series on MLOps, the basics, the stuff everyone thinks is rocket science, but is not. At least the Data Engineering part is not.
Last time we talked about Feature Stores and their critical role in the Machine Learning lifecycle. The next obvious step in the process would be to train a model of those features, at least that’s the typical workflow. At least that’s usually what those crazy Data Scientists do next.
Rocket science, I know, we have to create an ML Model before it can be used. Anyways, it might sound simple, and it sorta is. But, then again it sorta isn’t. There is a lot of proverbial creepy crawlies scurrying and bustling under that log when you look underneath it.
Well, let’s get to it.
Thanks to Delta for sponsoring this newsletter! I personally use Delta Lake on a daily basis, and I believe this technology represents the future of Data Engineering. Check out their website below.
Model Training Basics
Model Training. It’s the next thing that happens after the Feature Engineering step we talked about in a previous post. Once you have a pile o’ gold, in the form of features, the next step is to do something with them.
Typically at this point, some Data Scientists will start the R&D work of training models. Notice the plural in that last sentence. Models.
This is an essential point for Data Engineers who are new to Machine Learning Ops, repetitive tasks, that is. Think about it, it’s what we do. We automate everything. In essence, that’s what we do with Data Pipelines every day. We get data, we transform, we store, rinse and repeat.
The model training “pipeline” if you will, should meet a few basic requirements (much like any other pipeline).
Repeatable.
Model training in the beginning and after production will be done many times over.
Tracking
Every input and output should be tracked meticulously.
Easy
Training models should be easy as changing parameters and kicking off a job via a command line or a UI (like Airflow).
What does this look like in real life?
I think the model training lifecycle of MLOps might make a bit more sense if we do some pseudo code, so we can talk concretely about the important aspects of the process, rather than in simple theory.
Let’s pretend we are tasked with taking a SparkML regression model into “production.” This means some Data Scientist has come up with an idea, been doing POC work, we have features and are now at the point where they want to run many iterations on training a model. They are tired of manually changing data sources, tracking models, figuring how what the performance was, etc.
Basically, they need a system to easily train models in a reproducible and trackable manner!
Let’s look at some pseudo-PySpark code for training a model. Then we will go through it step by step, and talk about the important bits. Don’t worry about the code itself, but what the code is doing, that’s what matters.
The Big Picture about Model Training.
I know it’s a lot of pseudo-code, but let’s break down the big picture. Think about running this process a few times over and being able to answer questions about how it was run. What jumps out at you as important?
I’m just going to make a giant list, I’m assuming you’re intelligent engineers, since you are reading this after all.
There are a bunch of input parameters to model training that should be configurable.
Training and test datasets and/or uris.
Where to save a model that is produced.
Where to save model results.
hyperparameters.
specific features.
There are a bunch of parameters to control the output and track the output.
Where the model is saved.
Where to log all information for the job run.
train, test, models, results, and parameters used in the run.
Not shown, but how to actually run or automate the job. (Airflow for example).
This can all widely vary based on use cases of course, and our example is simplistic and contrived, but, I assure you, the basics of Model Training are mostly what we’ve talked about.
It’s imperative Model Training is configurable, to allow flexibility when re-training. It’s also of upmost important that all information related to the meta-data of Model Training be saved and logged, so they can be related and analyzed later.
Maybe you store all the data in Postgres, Delta Lake, or write it to a TXT file for crying out loud. Anything is better than someone later asking the question … “How was this model trained? What were the parameters, and what dataset was used to train it? What did the performance look like for the model when it was built? Where is the model stored?”
These questions are critical, and mostly the reason why ML projects fail to make it to production. Product, Marketing, Engineering, everyone starts to ask questions about a model being used or developed.
Typically the above workflow we’ve talked about is haphazard or simply doesn’t exist. It might not seem like a big deal, but these small things being missing lead to confusion, lost of trust, errors, and the general death of a project.
Getting the simple stuff right.
Making things configurable.
Make training easy to run.
Tracking every input and output.
These are just general engineering tasks that should be applied outside of MLOps as well, so it really isn’t a big deal. It’s just about knowing them Machine Learning context, what parameters and outputs are important in the context of training a ML model.
Did I miss something? Tell me about your MLOps experiences training models. Got horror stories? Share!