Previously, we discussed the basics of MLOps, things that are core tenants of Machine pipelines, but yet are apparently not cool enough to be given much attention by the talking heads. But, never fear, I have come to rescue you.
You've been living in George Orwell's 1984, oblivious to what's going on around you. Dutifully swallowing the drivel that Machine Learning is too complicated for the mindless masses, and far out of reach. The truth is far from that, especially for Data Engineers.
Today, we will be paying special homage to one of the unsung heroes of ML, that is, Feature Stores.
We will delve into
What are feature stores?
A preview of open source and other options.
How to use them.
How to build them.
Thanks to Delta for sponsoring this newsletter! I personally use Delta Lake on a daily basis, and I believe this technology represents the future of Data Engineering. Check out their website below.
What is a Feature Store?
Ok, so first I’m going to give you the boring textbook talk about a Feature Store in the context of Machine Learning, and once we’ve finished with that, I will give you to real-life download.
A Feature Store is a piece of Machine Learning architecture designed to manage and serve machine learning features across different projects and models.
A Feature Store serves two primary functions:
Feature Storage: It stores, organizes, and manages features — individual measurable properties or characteristics of a phenomenon being observed. Features are often stored in a database or other storage systems.
Feature Serving: It serves features in a consistent and efficient manner to machine learning models in both training and production environments.
Feature Stores act as a bridge between feature engineering (transforming raw data into a format that is compatible with machine learning algorithms) and model training, ensuring that the same feature transformation logic is applied consistently.
Here are some of the key benefits of using a Feature Store:
Consistency: A Feature Store helps in maintaining consistency across the training and serving environments, which is crucial for reliable model performance. It ensures that the same features are used during both the model training phase and the model serving/inference phase.
Reusability: With a Feature Store, machine learning teams can reuse features across different models, eliminating the need to recreate or recompute the same features again and again.
Efficiency: It saves time and resources by reducing redundant computations and data processing.
Discoverability: It allows teams to search and discover existing features that can be useful for their current projects.
Monitoring: It provides monitoring and management capabilities for features, such as tracking feature distributions over time and alerting on data drift.
While Feature Stores are becoming increasingly important in machine learning operations (MLOps), they also pose new challenges, such as managing the life cycle of features, ensuring data quality.
Real-Life Machine Learning Feature Stores.
Now that we have the boring stuff out of the way, I want to give some high-level comments and learnings from my years in the ML world. Also, keep in mind that the flow chart I showed above, half the chart is just normal Data Engineering, the ingestion, and transformation of data.
I have a lot to say and limited space, so I’m just going to make a giant list of things you should know.
Use cases for ML Feature Store vary widely and depend on the context.
One of the main “benefits” of a feature store is simply it’s a centralized location of trusted feature data.
It’s common to have feature stores that range from s3, Postgres, DynamoDB, Delta Lake, and everything in between.
Trusted Feature Stores increase the ability to develop and serve models 20x vs some data mess.
Feature Stores give a window into why models are acting/predicting a certain way.
Feature Stores are often backed up by complicated and complex data pipelines.
How, why, and where a feature came from in a Feature Store end up being almost as important as the features themselves.
I could keep going, but you get the idea.
Preview of Feature Store Options.
This is where you will find the world of MLOps, especially with Feature Stores, starts to show its ugly little cracks that have been hiding from you. Googling your options will pretty much get you nowhere, but more confused.
Depending on the use case and type of data the actual technology selected for the Feature Store is going to vary widely, and contrary to popular belief, the MLOps world is still very much in flux, the tools are sparse, inadequate, and limited.
Feature Stores
I’m not sure what else to do besides just run through the list of options on the table, and say a little blurb about each.
Databricks Feature Stores
Feast (open-source)
Postgres
DynamoDB
s3 (cloud storage)
Have others you use? Tell about them in the comments.
Databricks Feature Store
Probably one of the newest feature stores and one with great promise are Databricks Feature Stores. Databricks Feature Stores are probably one the, if not the most scalable option for a Feature Store that needs to deal with big data.
Databricks Feature Stores are built on top of Delta Lake, and therefore able to deal with TBs of feature data with ease. A Databricks Feature Store probably makes the most sense if you are using Big Data, and Machine Learning with Spark, or other distributed systems that require scale.
It would be excessively hard, although not impossible, to use something like Postgres when you’re churning out 50 million features+ a day. Besides scale, what else can you expect from Databricks Feature Stures?
Access via a UI, for exploration, etc.
Lineage in the form of models etc that are using those features (more important than you think at scale).
Tight integration with model serving and scoring.
Feast (open source).
Next, one of the other popular options that will appear in your feature store quest will be feast. It seems to be only of the only reasonable open-source options.
It’s available on GitHub and has about 4.5K stars. That makes it a serious contender, you know you’re not getting a half-baked product at that level.
What’s interesting about Feast is they have a paid-for version, of course, just like anyone else, so it makes you wonder at the long-term commitment by the money mongers to the actual open-source project, say what you will, but it’s always a worry.
What is there to know about Feast in a quick download? It’s a Python project.
pip install feast
You can create a feature repository, register features, and kick off a UI. It would obviously require some CI/CD and configuration for production, but I’m sure you are smart and could overcome that obstacle.
There is a wide range of data sources and offline sources, enough to keep anyone happy.
Also impressive are the options for deployments, AWS and Kubernetes. And the Feature Serving client is available of course in Python, with even a Golang option on the way! Impressive.
Interested in learning more, read through some nice documentation.
DIY Feature Stores
For a lot of my career in the ML space, it’s been the DIY option. Why? Mostly because 5 years ago there just weren’t that many well-known options, and the options available were less than ideal.
Honestly, sometimes the biggest challenge is that ML pipelines and requirements vary widely from business to business and use case to use case. It’s just hard to write MLOps pieces that can be applied across all projects in real-life ML.
I’ve used both Postgres, Delta Lake, and s3 as a feature store, each with its own challenges.
Designing your own feature store isn’t hard, it just requires some foresight and planning.
Postgres and Delta Lake work great for tabular features.
Delta Lake for scale.
Postgres for medium-small data.
s3 or other cloud storage for non-traditional features.
can be combined with Postgres to extend metadata and management.
Closing Thoughts.
Honestly, Feature Stores aren’t as hard or complex as you think. The very act of creating a central store for prepared features brings enormous value and efficiency.
Instead of Data Scientists spending hours and days searching, building, and depositing features into a single place themselves, before even doing the hard work, pointing to a single location of prepared features is a game changer for R&D and other model development.
Most Data Teams could probably just get away with coming up with a special schema in Postgres, if the data is at scale, with Delta Lake. The more serious use cases will be something like Feast or Databricks.
Great call out - it's often these "boring" systems that are the unsung heroes, eh?!
You may also want to look at FeatureByte, a newly released open source feature store + feature platform.
docs.featurebyte.com