Ever since the days of yore, SQL has been the bread and butter of everyone from the old school DBAs to the new school Data Engineers. Of course, Python is probably the bread bag if SQL is the sandwich, or is it the other way around?
There have been many attempts to unseat both SQL and Python over the years … Scala, Golang, and Rust have all been calling for the death of Python. GraphQL and REST APIs eliminate the need for SQL. Yeah right, dream on.
So, if we are doomed to use SQL and Python forever, then that begs a different question …
When should you use Python and when should you use SQL in your data pipelines?
You should check out Prefect, the sponsor of the newsletter this week! Prefect is a workflow orchestration tool that gives you observability across all of your data pipelines. Deploy your Python code in minutes with Prefect Cloud.
To SQL or not to SQL, or should I use Python?
I imagine the decision of when and where to use Python and SQL probably starts out very straightforward in the beginning. However as time starts to pass the the codebase grows, and things probably start to become less clear.
At some point, every Data Team is probably faced with a crisis of identity. Are we a SQL shop? Are we a code shop? WHAT ARE WE??!!
We probably start with good intentions, trying to drive that car straight down the center of the road, but over time as we get busy writing code and adding new features things start to warp. Maybe it even depends on the person doing the work that day.
What happens is you wake up a few laters and realize that …
My code base is %80+ SQL
My code base is %80+ Python.
I mean we could simply ask the question … is there a right and wrong answer here?? Probably there is.
Striking the correct balance.
I’m a fan of all things in moderation. I think this applies to the question of Python or SQL for data pipelines. The answer should be both, use the right tool at the right time, and stay away from the slippery slope where you become a one-show pony.
This idea of getting sucked in the ditch of using too much of a single tool is bad for the business, the developers, and the culture.
You want to be flexible, able to have good data pipelines that are easy to use, and put feature additions on without incurring mountains of technical debt. That requires making good decisions upfront.
The problem is, if you hitch your wagon to either SQL or Python wholesale you will eventually find yourself an inflexible mess on your hands.
The difference between a matter of taste and good technical decisions.
Instead of pontificating more, let’s actually look at real-life code and examples. How can we make accurate decisions that are good technical choices from an engineering perspective, and which choices are merely a matter of taste?
It can be hard to summarize into specific rules, so it probably goes on a case-by-case basis.
PySpark vs SparkSQL example.
A great example and very popular today based on the Databricks platform is the use of SparkSQL vs PySpark Dataframe API. The two code bases would look completely different, wouldn’t they?
And what does the same code look like in SQL?
I mean what can we say about these two pipelines concretely? They both have functions that are fairly small and straightforward to unit test, which is key to a good codebase.
Is one more readable than the other? That’s probably just a matter of taste.
I suppose it’s worth noting that the Python code in the first example is “broken” up slightly more than the SQL code. As in there are two functions getting the entire job one, where with the SQL code this is all rolled into a single SQL statement.
There is good and bad to that.
Most good engineers agree that having less code overall reduces bugs etc.
Most good engineers agree rolling too much logic into a single place is probably not a good idea.
So again, we come to an impasse. Technically the problems above could exist in either codebase, it really has to do with the team of developers working on the code.
But, is the coagulation of large pieces of code into giant monstroties more common in SQL or Python? I will leave that to you smart folk to answer.
Thinking about the work being done …
I think more or less, the decision to use more Python or SQL really comes down most of the time to what type of data you are working on, and the requirements of the business that probably drive the inherent complexity of the tasks.
If you’re working with tabular data either in an RDBMS or that comes from an RDBMS, with fairly normal business requirements and analytics … probably makes sense to stick with SQL.
If you’re working with semi-structured data or even tabular data that require complex transformations, anything in the ML space, APIs, or extreme reliance on testing for mission-critical data … you might be better off with Python.
At this point, some people might be asking “Why don’t you just look to the strengths of the data team, what they are comfortable with?”
Well, that is a slippery slope. We should let the technical requirements and good sound engineering decisions drive the tools we pick for our pipelines … not “convenience” or “because I’m used to this more.” That’s how bad things start to happen.
Just because you COULD do something in SQL or Python doesn’t mean you SHOULD.
Think about it …
If you have two large datasets in CSV format in s3 … one is `sales` information, the other is `customer` information and you get asked to build out a dataset that is the combination of these two datasets with some rolled-up analysis done …
Should you pick Python? Of course not!
This problem is best modeled and solved with SQL. It’s what SQL was made for … joins and rollups. Therefore this solution in SQL will be the least amount of code, easiest to understand, and the best solution long term.
Let’s say we are tasked with doing some feature engineering for a new ML model. Why in the world would you choose SQL? Feature engineering is an intricate problem with different transformations for different features, each requiring multiple unit tests to verify accuracy, etc.
Go with Python of course.
What about the stuff that isn’t so obvious?
Of course, every decision isn’t this easy. Sometimes is the thin grey lines that give us the most trouble. My suggestion is just to be a good engineer.
Examine the problem closely, maybe do a POC, what feels better, SQL or Python?
Look at your tech stack closely, does one solution fit in better than the other?
Examine your codebase, is it becoming too lopsided? Have you crossed the line of too much of a good thing?
I think it’s extremely important as Data Engineers we overcome our biases and tendencies to be comfortable. We should challenge ourselves to make good engineering decisions while thinking critically about the problem and proposed solution.
Let’s take a moment to speak out loud about the obvious pitfalls of each approach. Honesty is good practice.
What to be careful about with SQL pipelines.
Since SQL is probably the most common tool used for most Data Engineering pipelines, we should give a warning about the pitfalls that should be avoided when you find yourself slipping into the deep side of the pool.
Allowing spaghetti SQL queries to get out of control.
Not having any of the SQL queries tested (unit testing).
Allowing the same business logic to get “spread around.”
Not having good standards (dbt, etc.) on HOW you expect your SQL to be written.
Not writing idempotent data pipelines and queries.
When we become comfortable with anything, it is easy to let stuff go bad. Anyone who’s been around the block a time or two knows that out-of-control SQL queries are very commonplace.
Too much logic in one place, spaghetti SQL queries, and the same logic spread around and not being reused. These are very common issues that lead to serious tech debt and will set the stage for the degradation of the data platform as a whole.
What to be careful about with Python pipelines.
SQL isn’t the only one with the problems, Python data pipelines have their own set of problems and gotchas that seak up on the best of engineers.
Not having unit tests to cover logic.
Not following a coding standard (code looks like anything and everything)
Not writing clean and functional code (methods with 50+ lines of code).
Not managing dependencies well.
Not having a good development environment.
One of the hardest parts about coming into a Python-centric data platform is that you never know what you are going to find. When it comes to SQL, it might be dirty, but you will figure it out. But, when it comes to Python, it can get insanely dirty and impossible to debug quickly if “good clean code practices” are not followed.
Closing Thoughts
When it comes to SQL or Python for data pipelines, I vote for both. Both in the right place at the right time. Also, I think it’s easier than most people think to know when to use what.
Many times the problem being solved often dictates what tool to use. It’s also important for teams to understand which way their biases lie and to make sure they don’t fall into the ditch on one side.
I’ve seen my fair share of horrible SQL and Python repositories. I’m convinced it isn’t about which tool you use, but HOW you decide to approach using that tool. It’s about the Engineers who make choices every day as to what their solutions will look like.
You can write clean SQL and Python data pipelines, you can also write very bad ones.
Most SQL-based teams should not consider themselves as engineers...
I recently encountered a team with over 350 DBT models, 0 unit tests (although DBT has a dedicated package for this purpose), 0 documentation, and 0 diagrams built to visualize relationships between models (the only reference available to understand their models was the models themselves)... Additionally, more than half of the DBT models contained over 350 lines of code, with nested queries upon nested queries and CTEs stacked on top of each other.
I rejected that project...
With Snowflake only needs files, theres no need for anything else (orchestrators python, etc..). People should start learning more product features (Snowflake), instead of creating 1000s lines of external code replicating what SQL does much much orders of degree faster.
Sad most junior developers think creating all this mess is still necessary in modern cloud times where SaaS like snowflake only need files created via CDC snd that's it. Python should only be used AFTER it lands inside a CDW like Snowflake. I have experienced production level mess of +150k lines of python doing aggregations, converting/merging files with files lol.. Please stop doing that!! load whatever files you have directly from source snd let SQL do what it is good at.