Have you ever had one of those things in life where you just aren’t sure what to do with it? It seems like a good idea and you have visions of grandeur in your head, daydreams of the good times to be had. But, then the reality is somewhat different than you expected.
That’s how I feel about Data Engineering with AWS Lambdas. Yes, I have a handful of AWS Lambdas I’ve written running in production, yes they are nice, but sometimes I hate them as well.
They never seem to do just exactly what I want, sometimes I want to use them for everything, force them into something they were never meant to be, other times I hate them and migrate them away to some underutilized Airflow worker.
Alas, such is life.
With that being said, I realize AWS Lambdas are probably somewhat under-utilized in a Data Engineering context, so today I want to simply talk about the technicalities of using AWS Lambads in a Data Engineering context.
Thanks to Delta for sponsoring this newsletter! I personally use Delta Lake on a daily basis, and I believe this technology represents the future of Data Engineering. Check out their website below.
All Hail our benevolent master … AWS Lambda.
I get the feeling that AWS Lambdas aren’t used that much in the Data Engineering context, at least in the circles I run in. Not sure why. Well, maybe I do.
In the area of Apache Airflow and Databricks/Snowflake, AWS Lambdas are probably thought of as the domain of backend engineers and other such rats. But, I think that’s just because people need to have a better imagination, that’s all.
I want to give a very high-level overview of AWS Lambdas, and some of the technicals I think everyone should know about them, as well as what it’s like to deploy and manage them. This should give us a decent baseline to then discuss where and how AWS Lambdas might be useful in a Data Engineering context.
AWS Lambdas, what you need to know.
I’m going to try to give you the 10,000-foot download on AWS Lambdas.
“Run code without provisioning or managing servers, creating workload-aware cluster scaling logic, maintaining event integrations, or managing runtimes.“
I mean that’s really what it boils down to in my opinion. AWS Lambdas are, or should be, small bits of code running on small bits of data, in a serverless manner.
That’s part of what makes AWS Lambdas so attractive, they are a lightweight answer to some problems, requiring minimal effort and overhead to manage. Honestly, not sure why they aren’t embraced more for Data Engineering.
What else?
Lambdas can max out at 10GB of memory in size.
Lambdas can max out at 15 minutes of runtime before timeout.
You pay for both the size and runtime of a Lambdas.
The best way to deploy and manage an AWS Lambda in my opinion is the use of the `Container image` option. This consists of using a “Base” Docker image provided by AWS and layering on whatever you need. The following languages are provided support for base AWS Lambda images.
Most of the time after you’ve built an image, you can store it in AWS ECR (Elastic Container Registry), and then reference that image when creating a Lambda.
What does a Python Lambda image look like?
Something like this.
Does it look simple? That’s the point. AWS base images provide everything you need, you simply must layer in your code and any dependencies yourself.
How does an AWS Lambdas work?
Pretty much across the board, the default way to interact with an AWS Lambda is to write your code and have an entry point called `lambda_handler` that the Lambda will look for by default and exercise, although you could change this.
Three of the major concepts of an AWS Lambda are EntryPoint, Context, and Trigger.
There is a single entry point in the code called lambda_handler.
There is a “context” passed to every lambda when exercised.
You can attach “triggers” to Lambdas that exercise the lambda one some event (s3 file, schedule, whatever).
Below is an example entry point for a Python lambda which has a trigger that exercises the lambda whenever a file creation event happens in s3. That event is passed to the `context` of the lambda and certain information can be pulled from that context and taken action on.
https://github.com/danielbeach/PythonVsRustAWSLambda/blob/main/pythonLambda/python_main.py
How do you deploy a Lambda?
If you are taking the Docker route, there is a set of steps you would take to deploy a lambda. Of course, as you will see, these steps could be automated and built into CI/CD pipelines very easily.
This is not all-inclusive but gives you a general idea of a workflow.
Build the image.
Tag the image.
Push the image.
Exercise the lambda.
Something to that effect would do the job and can easily be automated.
Also, lest ye be afraid of Docker and using something like Rust, which has no official provided AWS image, it’s still possible to build Lambdas and use binaries to do so, no Docker needed! See the below links for an example of building a Rust lambda.
https://github.com/danielbeach/PythonVsRustAWSLambda
https://www.confessionsofadataguy.com/aws-lambdas-python-vs-rust-performance-and-cost-savings/
Lambdas for Data Engineering.
Ok, so maybe you know a little something more about AWS Lambdas than you did before. Maybe not. Either way, let’s talk about the intersection of AWS Lambdas and Data Engineering. The challenges, the realities, and real-life use cases.
Real-Life.
I think there are a few reasons that AWS Lambdas don’t really play a large part in many Data Engineering teams.
Require small datasets.
Short runtimes.
Don’t fit easily into architecture.
Logging and monitoring.
Uncomfortable with pure coding solutions.
Simply use other tools like Airflow.
Well, I can say that many of these reasons are valid. Do you really want to add something else to the tech stack? Are the group’s coding skills really that good? How can we log and monitor the lambdas? We don’t have good DevOps, CI/CD or Docker, can we really handle the deployments of lambdas?
Valid valid. But I think there are real benefits and use cases for AWS Lambdas. I’ve written a number of them in my life, and have many running in Production as I write this.
So why would a Data Engineer choose an AWS Lambda?
If you want a cheap solution that embraces serverless and reduces complexity, than Lambdas are your friend.
Cheap.
Easy to use.
Drives better CI/CD and DevOps practices.
Fits well into an event-driven problem space.
Do you have smaller-ish CSV files, or any files at all, that you store in an AWS bucket that requires some processing? AWS Lambda is perfect for that.
Do you have some data quality checks you want to run on files stored in the cloud? AWS Lambda is perfect for that.
Do you have some small task that needs to run a schedule and execute some quick and easy logic? AWS Lambda is perfect for that.
Trust me, you can always find use cases once you start looking. Technically most things you run on an Airflow Worker can probably fit and run on an AWS Lambda … and it’s serverless!
Basically what it boils down to for Data Engineering teams is that if you want to reduce architecture complexity, and cost, and drive better DevOps, CI/CD, and coding skills … AWS Lambdas are your friend. They are like the little engine that could.
Of course, they will not fit every case. Nothing over 10GB unless you are streaming, and runtime no longer than 15 minutes. But, there are plenty of niche use cases in Data Engineering where files are being manipulated, bash scripts are being run, and little scripts kicking off to do this or that. All those things are perfect for the AWS Lambda, give it a try!