I know you all saw it coming. Probably in the back of your head for the last few years while you were burning queries on Snowflake and spinning up Databricks clusters like there was no tomorrow … something in your mind was thinking … “This is too good to be true.”
The chickens have come home to roost.
“Every data team is going to be pressured in 2023 to start caring deeply about Cloud spend. The Modern Data Stack reckoning is here. No more easy money.”
Today I want to talk about a few things.
Why the Modern Data Stack costs are getting out of control.
4 Easy Steps to reduce costs.
Conclusion.
Let's make you the cost-cutting overlord, destroying all bad practices and laying waste to the complex architecture laid down by those bloated SaaS companies and “architects” hawking their wares on every street corner.
Modern Data Stack costs spiraling out of control.
It’s no surprise things are getting out of hand in pretty much every data team, great or small. I mean, it starts with the big three.
AWS
GCP
Azure
These titans of industry are like the great Kings of old, marching around and sitting in their lofty castles imposing their burdens and taxes upon us all. In the diagram above, with AWS, GCP, and Azure at the bottom of the pyramid, those behemoths gobble up vast sums of our time, energy, and resources.
There are a few reasons why everyone is talking about costs, and rightly. Here’s my take on it.
Pretty much every data team sees the shiny new tools (Databricks, Snowflake, etc.) and thinks they aren’t keeping up without using them. This leads to overly complex and expensive architecture.
There has been a lot of add-on SaaS companies that provide everything from GUI ETL, Connectors/Adaptors/ managed compute, storage, and various other solutions. This all adds to the cost.
Hiring a plethora of engineers to support all this tooling and the work involved around implementation and patience.
Migration projects to these “new” tools can drag on and cost a lot of money (dual pipelines running, extra work, etc.)
All things together can create the perfect storm of costs that don’t seem to stop, and in fact, keep on rising.
“Excitment around new tools like Databricks and Snowflake can lead to over-engineering and using a sledge hammer when in fact you need a small hammer. It ends up at the bottom line, costing big money.”
4 Easy Steps to Reduce Cloud Costs.
Most of these five easy steps to reduce cloud costs are focused more on data teams and data engineers. On the surface of course they look easy, but don’t be fooled, they will take a little time and focus to walk through and resolve. But, the payoff, in the end, will be worth it.
You are sure to save noticeable amounts of money if you approach each step with vigor.
“To be honest, most of these cost saving steps don’t get done on most data teams. Why? Because they always fall into the techincal debt category, that is until everyone is concerned about spend.”
I used these 4 steps because they are attainable and doable for any data team, and they can be done “on the side,” while other important work continues.
Identify the long-running processes.
Look into your cloud storage usage.
Choose the correct tool for the job.
Deep dive into your compute usage.
See? Not that bad eh? Ok, well maybe they are a little involved.
Identify the long-running processes.
This is probably one of the easiest steps anyone can take on a data team to reduce costs, and has the benefit of being a “fun” task for data engineers to walk on. If you’re one of the teaming masses using Airflow, for example, finding long-running tasks is a trivial matter.
This sort of identify the long running data pipelines or tasks, and find out what the problem is, gives data engineers something hard and fast to work on, that gives a certain mental payback ... aka it’s fun and rewarding work.
The 80/20 rule applies here. 20% of your data tasks are eating 80% of the costs.
It’s easy to fix, throw engineers at those problem spots, and they will find a solution.
Doing this will identify where the money is being spent, or a good chunk of it.
Optimize, optimize, optimize, that is the name of the game. Look for those problem children and send them to the principal’s office, give them the old whack with the ruler. Chances are you’re spending money, a decent chunk, on some long-running and poorly designed processes that could have runtimes cut by 3/4. Get to it.
Look into your cloud storage usage.
Buckets, buckets, and more buckets, tired of hearing about buckets yet? Cloud storage costs are one of those sneaky little buggers that comes in the middle of the night and steals all your quarters out of the change jar.
I know, I know, everyone always says “storage is cheap,” and to a certain point it is. But, things can add up once you hit a few hundred terabytes and growing. But luckily there is usually plenty of low-hanging fruit to be had on those storage trees, delicious fruit full of money-saving nectar to satisfy your soul.
Make sure all your data is compressed.
Find all your unused and un-accessed data, in all environments!!
Put some of your data into cold storage, after a certain amount of time.
Nothing is easier than working with cloud storage, start with ensuring all your data is compressed … store .CSV files in the cloud? Think about .gz’n them, it will save you money.
Go look into that awful development environment you have, you will be surprised how fast data piles up in there after a few years. Nothing is quite as satisfying as deleting vast amounts of data.
The last money-saving tool that is not often used, is simply putting old data in cold storage, all the crud that is years old and you hardly ever use … easy peasy lemon squeezy.
Choose the correct tool for the job.
What’s that saying? “Beggers can’t be choosers”, or can they be? This might be one of those times when you should be picky and choosy, like when your mom yelled at you growing up to eat all your food and stop picking at things.
“One of the most common and expensive pitfalls that data teams make is running after shiny new tools, or being enamored with a certain tool, to the exclusion of all others, like that poor horse plodding along the city parade with those blinders on, oblivious to all those screaming kids pelting it with candy.”
You need to be critical with yourself and your team about what you are using to process what data.
Don’t use Big Data tools like Spark to process Pandas-sized data.
Move processes around between tools to match requirements,
Evaluate each tool and cut where you can.
Think about it, your paying to spin up some Databricks cluster with Spark, so you’re being charged some AWS or GCP instance price, plus some Databricks cost on top, all for a few GBs of data that could be processed on an Airflow worker with Pandas, Datafusion, or Polars.
Just use common sense and save yourself a bunch of money.
If you’re lucky you might even identify some tool that simply isn’t giving enough bang for your buck. When times are easy it defaults to just picking stuff willy-nilly with no thought to ROI, or if there is a different way to solve the problem.
Don’t be scared to be creative and find a way to solve problems with a smaller subset of tools. For example, Airflow is a tried and true technology that offers a massive range of Operators and features that might be able to replace certain data tasks found elsewhere.
Deep dive into your compute usage.
I saved the hardest one for last. Compute. If every data team was to look at their bill at the end of the month … where are most of the money going? Compute. Compute. Compute.
“Compute is the start and the end of data engineering, it’s the work-horse that does all our transformation and acts upon most of what we create as data engineers. Therefore it’s going to eat a large portion of our money.”
There is no easy way to solve the compute problem, but there are a few simple steps that anyone can take.
Identify under-utilized compute.
Identify over-utilized (bottleneck) compute.
Identify whether you are using the correct type of compute.
Take a gander at all your data transformation and workloads, and inspect what type of resources they are consuming, most tools allow you to see some sort of memory and CPU utilization metrics. Dive in, and find where you are wasting compute and money.
On the other end find where your compute is maxed out and causing bottlenecks. This can have some overlap with the optimizations we talked about earlier, but you are sure to find a few spots where you can dial down some resources. That’s just life.
Also, never forget to inspect the type of compute you are using, what I mean by this is the instance types you are using in the cloud. There are massive differences in how, when, and why you are using them. It’s a bit of black magic and luck, but you can reduce costs by simply switching to SPOT, different regions, or an instance type that maybe has slightly different CPU and memory stats.
So a little research, save some money.
Conclusion.
Well, there you have it. I hope you weren’t looking for some late-night TV fix-it duct tape or goop. For 9.99 you can reduce all your cloud costs in half! Life doesn’t work like that, neither does data engineering. Be wary of anyone who tells you otherwise.
But, you can start with the basics.
Identify the long-running processes.
Look into your cloud storage usage.
Choose the correct tool for the job.
Deep dive into your compute usage.
I’m confident if you tackle a few things on this list, and give it a few days you can save a little money. Your boss will pat you on the head and throw you a bone maybe. Have other suggestions or stories of saving big money?! Please share!