Discover more from Data Engineering Central
Demystifying the Large Language Models (LLMs)
For Data Engineers
This is one of those funny times in the tech world where you can get many folks probably starting to feel left behind, especially with AI and LLMs (Large Language Models).
As someone who’s worked in and around MLOps for about 5 years now, I can relate to the feeling. But, I can also relate to the realization when looking “behind the curtain,” the feeling of surprise that it isn’t black or dark magic. It’s just code.
Today I’m going to prove that to you. I’m going to show you that LLMs might be one the greatest advancements of our programming age, but that at the same time, they are remarkably not different than other Machine Learning flows from the Operations standpoint.
Data Engineering Central is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
Thanks to Delta for sponsoring this newsletter! I personally use Delta Lake on a daily basis, and I believe this technology represents the future of Data Engineering. Check out their website below.
What are LLMs?
Ok, so if we were Machine Learning Engineers or Data Scientists who actually cared about the internal workings of our ML models maybe this wouldn’t be so easy. But, thank the good Lord, we are Data Engineers.
We don’t worry about such fuss.
We worry about the MLOps part of the story. How can we automate the building and exercising of Machine Learning workflows? This is what concerns us.
Where are here to take our Engineering best practices and knowledge and apply it to the problem at hand. In this case Large Language Models.
Being that this area of AI and ML is surrounded by such a haze of mystical fog and fairy dust, it’s time we bring it all back down to earth. We are going to pull back the curtain on The Wizard of Oz, and peer at the person pulling the strings behind all the fizz and pops happening on social media.
Demystify the LLM.
First, we are going to demystify the LLM by treating it like just another ML model.
Sure, it’s different from a lot of previous ML models we are used to, and if you’re new to ML, it can seem like the devil’s work, but I think by the end of this article you can feel like you have a better grasp at what it takes to be an Engineer working on LLMs.
First off, yes, I’m going to boil down LLM and the toolchain that goes with it A LOT. One thing you have to remember is that we have a few basic concepts going on.
Many parts of the toolchain you might hear and see about on YouTube or Linkedin are actually “ancillary” technologies to help improve the human experience of interacting with LLMs and to make things “responsive and fast.”
What are the simple basics you need to actually understand, how you are probably experiencing it as an end-user, and therefore how companies try to build a system around LLMs to make products? Let’s try to answer these questions.
Take a quick look at the figure above. What are the components?
LLM model object (yes, a physical file for the model).
Code to “exercise” the model.
You need code that understands the model and can give input to a model and get a response.
Some way to interact with those two things above.
Is this simplistic? Yes. Is this true? Yes.
Instead of getting into more theory. Let’s SHOW you an example of actually “doing” or “running” a LLM model. This will give us a better baseline to talk about challenges for MLOps and Data Engineering in an LLM context.
First, things first. A model. And some Python.
OpenLLaMa - trying it out.
When we want to start learning LLMs and how generally you can work with and access them via code, we need to actually get our hands on a model we can use.
You can read about this project here on GitHub in more detail. Many times when you start reading about LLMs or looking into them you will start seeing numbers 7B, 13B, 30B, etc. This for lack of better terms and easy understanding is the “size” of the model, the number of input parameters it was trained on.
Obviously the larger the number, probably the more accurate but more complex and probably more costly to run inference on.
Smaller number = smaller model = quicker and faster.
Most of the open-source LLM models you will hear about are based on different papers, models, and trained on different data, and fine-tuned differently. Some of them are for specialized things like ChatBots etc.
Notes for Data Engineers running LLMs for the first time.
There are a few things I want Data Engineers to note if this is your first time exercising an LLM via code yourself. It’s important to note there is no magic here, just a different set of tools.
We will discuss some of this more in detail a little later.
You will need the “model” code.
You will need a few simple pip installs.
An important part of LLM’s is the input.
An important part of LLM’s is the output.
It’s a resource-intensive task.
It’s more about user interaction and experience than most other ML applications.
Everything we discuss might slightly vary depending on the model you are running.
With that being said, how easy or hard is it to run an OpenLLaMA 3B (smallest) model and ask it what a Data Engineer is?
You will need a minimal set of Python installs.
After that, in our case, we need to git clone the HuggingFace model OpenLLaMA. You will have to ensure the code you run, is run from the directory containing this cloned code.
Once we have that code cloned locally, we can create a Python file with the below code. Put your file somewhere in the vicinity of your previously cloned git repo, and reference it in your script as shown below.
And running that Python script gives the below output.
It’s important to note that this simple question and script for our OpenLLaMA 3B model takes a few minutes to produce a response! On a healthy M1 MacBook Pro.
No magic to be found here.
Just like with most Machine Learning applications, there is no magic. It’s just knowing things specific to the ML problem you are trying to solve.
As Data Engineers many times our job is to “productionize” ML pipelines, working closely with Data Scientists. Do we really need to know the inner workings of every ML model including a LLM? No.
The inner workings of LLMs might seem like black magic to you. I get it. But the point of the above code was to show you that it’s not out of your reach.
Just like the rest of ML, it might be surrounded by a cloud and cloak of fog, but … if I can write and exercise an LLM … so can you.
I also wrapped up this code and requirements inside a Dockerfile and put it on GitHub. This gives you no excuse to play around with an LLM yourself.
What did we need to interact with a LLM? These things give you hints at what Data Engineers might work on.
An environment or system setup to handle LLMs.
Some properly installed dependencies like pytorch, transformer, and the like.
A downloaded model to actually exercise.
The ability to write code to interact with the model.
The ability to take input from a user and give it to the model.
The ability to do something with the model output.
Of course, we are just scratching the surface of a real-life production LLM toolchain. There would be more tools, caching, etc. Setting up all these systems, making them work together nicely, and the like will keep some engineers busy.
But, that code is easy, isn’t it?! Give it a try.
Ok, Real-Talk about Data Engineering LLMs.
Ok, so now that I have brought you to the mountaintop and you are ready to become an LLM and ML wizard, it’s time for me to tear you down.
It’s not that easy. Sure, we just saw that LLMs are no black magic, but there are some serious hurdles to overcome when starting to tackle LLMs for production use cases, as Data Engineers.
Training and Finetuning LLM models is another beast.
The data required to do the above is different from tabular data most Data Engineers work with.
LLMs are slow without catching and have large resource requirements.
The user experience to get input is important.
Returning the response to the user is important.
Both of the above require “Web Dev” UI work. It's not something usually in the domain of Data Engineers.
The toolchain used to enhance LLMs is large and complex.
There isn’t a ton of documentation or clear how-do-this guides, it’s a new domain.
MLOps around LLMs is quite new and unknown at this point.
On the one hand, we saw how easy it is to get ahold of an open-source LLM and run some simple inference, but we know that getting ticked by easy “hello world” examples is far from the reality of doing things correctly in production.
You have to manage models. Versions. Training data. Track metrics. Create input and output UIs or prompts. Serve models with extensive resources. Automate everything.
If only it were easy.
I hope I opened the door a little for you with LLMs with this simple example and exercise. It’s easy to think something is out of reach just because it is unfamiliar and we don’t know where to start. But, the best place to start is somewhere.
Please comment if you have some experience with LLMs from a Data Engineering perspective. We all want to hear about it!