It’s the same old story since the beginning of time. Them vs. Us. The endless battles that rage on through the ages. I remember the days of Postgres vs. MySQL, SSIS vs. Data Services, and Kimball vs. Immom, it just never ends … especially in tech.
I feel like the battle of this decade is Snowflake vs. Databricks. The endless benchmarks and nitpicking about query speeds. When honestly, most of the arguments miss the point of Data Engineering entirely.
We are going to solve this once and for all, set all things aright. No more wondering about which tool to use. Things are not all doom and gloom, always pitting this against that, leave that for the weak-minded Reddit genius always perusing the putrid offscourings looking for their next victim.
You should check out Prefect, the sponsor of the newsletter this week! Prefect is a workflow orchestration tool that gives you observability across all of your data pipelines. Deploy your Python code in minutes with Prefect Cloud.
The tech battle of the decade.
Well, you might accuse me, rightly so, that maybe I’m late to the argument. Better late than never. Both Snowflake and Databricks are touted for their extensive capabilities, yet they serve slightly different purposes in the data pipeline and platforms, even if some overlap exists.
As a data engineer, diving deep into the comparative analysis of Databricks and Snowflake can equip you with the knowledge to make informed decisions that align with current or future project requirements.
I believe the truth about Databricks vs. Snowflake is shown in the figure above. It’s really what it all boils down to.
Databricks is for those programming and Machine Learning centric workloads and teams, those in the Data Lake or Lake House world. Snowflake is for the classically SQL comfort of a Data Warehouse team.
Yes, of course, they overlap, as with any set of tools, if you really want, you can pit them against each other and accomplish some of the same things by going different routes.
Can you do Machine Learning with Snowflake? Yes, if you want to. If your focus is ML should you use Databricks? Of course.
Can you build a DBT-based Data Warehouse in Databricks? Sure, if you want to. Should you use Snowflake to do this? Yes.
Databricks is both ETL and Data Storage, half-and-half solution. Snowflake is more Data Storage, and can be ETL if all SQL driven.
Think about it this way.
If you walk into a bar (GitHub repo) filled with Databricks users, what would you expect to find?
Lots of functional PySpark code.
Machine Learning Pipelines.
Some Delta Lake stuff.
Very programming-centric tasks.
If you walk into a bar (GitHub repo) filled with Snowflake users, what would you expect to find?
Lots of SQL.
Probably a tool like dbt.
Lots of Data Warehousing concepts.
So, should it always be Databricks vs. Snowflake? Probably not. What type of data team you are, or want to be, and the problems you are solving should dictate which tool you choose!
That being said, in case you live under a rock, or inside a SQL Server in some musty cubicle farm, let’s cover the Databricks and Snowflake basics. The “what they are.”
Databricks: The Unified Data Analytics Platform
Databricks, a unified data analytics platform, is designed to be collaborative and integrated, offering a multitude of services from data preparation to analytics and machine learning. Its collaborative notebooks feature facilitates seamless interaction among data scientists, engineers, and business analysts, making it a suitable choice for organizations looking to foster a collaborative culture closely tied to complex Machine Learning and Data Pipelines.
Key Features:
Unified Analytics: Databricks brings together data engineering, data science, and analytics on a single platform.
Delta Lake: The Delta Lake feature provides reliable data lakes and fast analytic solutions.
Machine Learning: Offers an interactive workspace that enables users to build and train machine learning models effortlessly.
A very simple idea of how Databricks workflows take place.
Set Up Databricks Workspace:
Sign in to your Databricks workspace.
Create a new cluster: Choose a Databricks Runtime version that supports Delta Lake.
Create a Delta Table:
Create a new notebook in your Databricks workspace.
In the notebook, execute the following commands to create a Delta table:
Perform Operations on Delta Table:
You can now perform update, delete, and merge operations on your Delta table which are normally not possible in a regular Parquet table. Here's an example of an update operation:
Query Delta Table:
You can use Spark SQL to query the Delta table as you would with any other table:
Visualize Data:
You can create visualizations of your data directly in Databricks. Just click on the bar chart icon in the toolbar above the cell with your command. This will automatically create a bar chart, but you can adjust the settings to create other types of visualizations.
Schedule Notebooks:
You can schedule your notebooks to run on a regular basis, right within the Databricks UI. This can be useful for updating your Delta tables with new data on a regular basis.
Snowflake: The Data Warehouse Built for the Cloud
On the other hand, Snowflake, a fully managed cloud data warehouse, provides a robust solution for data warehousing, data lakes, data engineering, data science, modern data sharing, and more. It's built to handle a multitude of data workloads via its unique architecture known as a multi-cluster, shared data architecture.
Key Features:
Performance: Snowflake's architecture allows for instant, infinite scalability, ensuring exceptional performance even with concurrent workloads.
Data Sharing: Enables seamless and secure data sharing with a simplified approach.
Zero Maintenance: Being fully managed, it requires no management or tuning, making it a hassle-free solution for data engineers.
What would an extremely basic dbt + Snowflake project look like?
Create a Directory Structure:
First, you'll need to have a directory for your dbt project. In your directory, create subdirectories for your models, analysis, and tests. Here's an example directory structure:
Create a dbt Profile:
Create a profiles.yml
file in the ~/.dbt/
directory with your Snowflake connection details:
Create a dbt Project File: Create a dbt_project.yml
file in the root of your dbt project directory:
Create a Model: Inside your models
directory, create a model file, for example, my_first_model.sql
:
Thinking about it all.
I’m sure you can spot the subtle differences. Of course, my example for each case is simple and the reality can vary widely depending on the Data Team, but at the core/basic level, it does give a “feel” for what it’s like to work with Databricks or Snowflake.
In the end, they have some overlap because doing a thing with data is inherently open to interpretation, but I do believe that the business use case and the team dynamics are going to drive you to one of the two tools in an obvious manner.
The truth about Databricks vs. Snowflake is that they don’t really have to be pitted against each other. There are enough companies focused on being programming-centric with ML use cases. Just as there are many companies that are more SQL-centric with Data Warehousing at their core.
Not everything has to be a fight.