Kubernetes for Data Engineers

The Age of Containers

Dec 11, 2023

We do live in the Age of Containers, it’s the world we all live in. Docker has become standard. Data has grown. Everyone and everything is in the cloud. The Modern Data Stack we’ve all been working low these many hard years has only made the need for an agnostic and scalable container platform more real.

And this brings us to today’s topic.

We want to give Data Engineers an introduction to Kubernetes. It’s a tool everyone talks about, but not that many folks get a chance to get their hands dirty with.

Hopefully, by the end of this article, you will at least know enough to be dangerous and break something. That’s the idea anyway.

We will start from ground zero and play around with minikube, and get our feet wet so to speak. Concepts first, then the fun stuff.

You should check out Prefect, the sponsor of the newsletter this week! Prefect is a workflow orchestration tool that gives you observability across all of your data pipelines. Deploy your Python code in minutes with Prefect Cloud.

Introduction to Kubernetes

So, let’s start from scratch and pretend you’ve been living in a Mad Max apocalypse and have no idea about what Kubernetes is, and does. We will start with the “what”, and then the “why,” concepts before we go deeper.

“Kubernetes, also known as K8s, is an open-source system for automating deployment, scaling, and management of containerized applications.” - kubernetes website

How can we distill this down even more?

Kubernetes provides the ability to manage and abstract away a cluster of machines from which various applications can be served in a high availability and scalabable manner. - me

Let’s draw out this concept of Kubernetes, and then talk about the concepts we see.

While I’m sure my Kubernetes ramblings will gain the ire of many a seasoned user, I’m here just to impart the imperfect but hopefully helpful overview.

A Kubernetes cluster consists of worker machines called Nodes.
Nodes can have one or more PODs on them.
PODs are where Containers run.

And this is all virtual.

I’ve used Kubernetes myself over the years for several things, the description of which may help the reader who is unfamiliar with Kubernetes where it can help in a Data Engineering context.

What’s the best way for a Data Engineer to think about how Kubernetes can help in a data context, and in general?

Deploy distributed systems like Spark inside Kubernetes.
Deploy data apps inside Kubernetes
Deploy works from tools like Prefect onto Kubernetes.
General compute for various workloads.

What is Kubernetes good for? It’s a place to run and deploy “things”.

It’s the melting pot of the data world. It can be what you want it to be. But it’s not a joke to run and learn.

Options in the real world.

So in the real data world, I’ve worked with two different types of Kubernetes deployments.

Managed Kubernetes much like GKE from Google Cloud.
Self-hosted Kubernetes clusters.

In option 1 you can easily with code spin up and down Kubernetes clusters with a little code or a click of a button. In option 2 you typically have a host of DevOps and Platform folks, a team, dedicated to supporting and running Kubernetes.

Typically option 2 is for very large teams, with hundreds or thousands of engineers deploying a wide variety of applications and services. It’s a pain. More than you can imagine.

I’m in favor of hosted Kubernetes clusters like GKE or others, simply because of lower administrative overhead. Kinda like using RDS instead of installing and managing your own Postgres instance.

Ok, enough lofty talk, it’s hard to really learn more without poking around our fingers on the keyboard and doing something useful. While most of us don’t want to spend the money or have access to a Kubernetes cluster, we play pretend with something called minikube.

“minikube quickly sets up a local Kubernetes cluster on macOS, Linux, and Windows. We proudly focus on helping application developers and new Kubernetes users.” - docs

A few notes before we begin.

Before we start playing on the command line I should mention a few more specific topics and concepts about working with Kubernetes, things you will hear and read about once you start digging in.

Services - Exposing an application running on a POD to the network (other PODS). Think Postgres.
Persistent Volumes - since we are working in a virtual environment, we need a way to have permanent storage.
Helm Charts - Tool to build with configuration complex Kubernetes applications (think multiple PODs, storage, services, all working together).

Learning with minikube.

Thank you for reading Data Engineering Central. This post is public so feel free to share it.

What do you need?

This will allow us to pretend like we are playing with a Kubernetes cluster, I highly recommend this for playing around and learning about Kubernetes.

To just poke around and get something deployed locally into Kubernetes, let’s mess around Arroyo, a new Rust-based streaming SQL tool.

Let’s try to do the following …

Deploy this tool using a Helm chart into local minikube/Kubernetes.
Look at the services, PODs, and other things we will specify.
Learn what we can if we are new to this.

First step, start minikube you hobbit.

Next, we save this YAML file somewhere, which is a helm chart that tells minikube how to install Arroyo into the cluster.

Now, we can add the Arroyo repo to our Heml installation, and deploy it.

Now I get a bunch of messages that apparently it worked.

Checking our Kubernetes application install.

Now we have done that, maybe you can learn something and solidify some of these concepts we talked about earlier. The first step is going to be to start to use our command line to describe and inspect what we have running in our “Kubernetes” local cluster.

First, remember those PODs we talked about before, we should be able to see a bunch of Arroyo PODs running using kubectl commands.

Ah, now we can see a bunch of PODS, many of them with Arroyo in their name. It’s obvious what some of them are … we can see things like Postgres, and Prometheus being listed.

What else could we do? How about looking at all the services that Arroyo is exposing and what ports they are using?

Hopefully, this is starting to bring some concepts to bear that we talked about earlier, if you are new to Kuberentes.

We can deploy applications onto PODs, expose them on a network to talk to each other as needed, etc. We can even look at the volume (storage) claims being used by the PODs.

What do you think?

I hope if you are new to Kubernetes and have never had the chance to work with it, this has been helpful to at least introduce some of the basic and core concepts.

I personally think it’s hard to grasp new tools without playing around, and minikube + helm is the perfect way to learn Kubernetes on the cheap, on your computer.

If you are a Data Engineer and want to learn more about Kubernetes, try a project like writing a deployment for Prefect, and see if you can get that running.

Kubernetes is an interesting tool that isn’t used that much in Data Engineering. It’s seen as more of a tool for platform teams and others to run different applications. Sometimes you will find folks that have deployed Spark, or maybe their orchestration workers with Kubernetes.

How I’ve used Kubernetes

I’ve personally used Kubernetes for a few different production tasks.

Customer geospatial data pipelines for distributed processing.
Running orchestrator tools.
Deploying small ML models and pipeline components.

To me, Kubernetes is just a way to save time and make deployments easy, if it’s managed correctly and doesn’t get out of control. But, many times you can simply find managed services and other tools like EC2 instances or Fargate to solve the same problems.

Kubernetes is great at providing fault tolerances and other features if you’re just a team that needs to deploy a TON of random data components.

Do I miss working on Kubernetes? No, not that much. Did I learn already using Kubernetes, very much so.