Redshift vs Snowflake vs BigQuery vs Databricks vs ...
Why is the Data Warehouse Battle still going?
The other day I read a post on that most sinister of websites, r/dataengineering, known for its unapologetically ruthless hoard of so-called Data Engineers.
That wicked rabble of ne'er-do-wells was pontificating on the current state of the Data Warehouse, in the context of the big cadre of tooling at our disposal.
Redshift
Snowflake
BigQuery
Databricks
etc.
One thing I find slightly amusing after these years of data life is the vendor-led confusion as to what exactly is or isn’t a Data Warehouse.
It’s unfortunate we find ourselves where we do, simply because marketing and produce teams have conspired against us all, but here we are nonetheless.
What do we know …
Data Warehouses came first, they ran on SQL Server, Oracle, etc in the beginning. Many people still associate the word Data Warehouse with these RDBMS, which is not necessarily the case.
Data Lakes came next. That’s what people call dumping files of various sorts (CSV, Parquet, JSON, etc.) into cloud storage buckets like s3.
Tools like Delta Lake came (a combination of the above two … file-based systems that provided DW-like features, aka ACID, CRUD, etc.) Vendors coined this Lake House.
But, you have to remember, Kimbal’s Data Warehouse toolkit is still the Bible of Data Modeling … at least for now.
What happens if you build a Data Warehouse with facts and dimensions ontop of Delta Lake? What is it? Kick the old Bronze, Sliver, Gold nonsense to the curb.
Databricks would probably tell you it’s an incorrect Lake House, and Data Warehouse folk on SQL Server would probably tell you it’s a Data Warehouse gone wrong. Maybe we are all wrong. Or right. Maybe it’s the Twilight Zone.
Your data store is what you make it be. You’re the data people, you’re the Engineer. You can build what you want.
What it comes down to is that there is a battle for your data, companies, and tools grinding their axes and eyeing each other, all vying for a chance to engulf your data with promises of neverending sunshine and problem-free Data Engineering forever.
The Tooling Battle.
Redshift vs Snowflake vs BigQuery vs Databricks. Do you ever feel like a crazy person?? Why does it have to be so hard? Why does everything have to be so complicated, why do people defend their tool of choice to the death?
I think we should cut through the layers of marketing and product detritus that has built up and just say out loud which tools are good for what.
Sure, you can make any tool work for you if you want, it’s amazing what Engineers will do to simply use what they want. They will absolutely get that square peg through a round hole one way or another.
But, say no more, let’s just tell people who to use which tool and leave the rest for the rabble to squabble over.
Which Tool to Use For What?
Few things in life are as wonderful as making people mad by poking a stick in the eye of their golden calf. So let’s have a go.
Redshift
You should never use Redshift.
It’s expensive and you can equate it to an oversized and overpriced SQLServer.
If you are an AWS Shop and you’ve been using SQL Server, Oracle, or whatever, and you want to make your bosses happy by doing something cool … dump your data in Redshift and move on.
Snowflake
Snowflake is for DBT and SQL junkies who are like tweakers when they haven’t written a SQL query in the last 15 minutes.
For people who are bad programmers.
Don’t like doing ML stuff.
Like burning money.
BigQuery
Only goody-two-shoes and bamboozled ninnies who think GCP is the greatest use of this tool.
It's old and never changes much.
Simply lacks 3/4 of the features of the other tools.
Will never have the market share of the other tools.
Warehousing tool for those who don’t know how to build real Warehouses.
Databricks
The best option to pick.
Unparalleled in features and options.
Machine Learning GOAT.
For good programmers.
SQL suckers can still use it.
Best in class features.
Well, did I make you mad? I’m probably not far off from the truth, at least my truth.
What can we boil it down to?
People can build whatever they want with whatever they want. Often times it’s less about the tools and more about the Engineers, processes, and data.
Sure, at some point the data outgrows the old RDBMS and you have no choice. But, beyond that, do whatever you want. Just do it well and understand the tradeoffs you are making.
My day is made, except for that one tool I obsess over that was not covered. I need my fix.
Laughing on the floor here at the closing.