I usually like poking at things that are new and strange, and sometimes things that I am not familiar with. Dbt is one of those things for me. Sure, I’ve played around with it over so slighly in the past, but as someone who tries to write as little SQL as possible, I’ve pretty much avoided it.
It’s not that I have anything against SQL, it’s just when you do something for well over a decade, it becomes old hat and you just don’t want to do it anymore.
Either way, dbt is clearly here to stay for the long-haul, and I do have enough SparkSQL rattling around in Databricks Jobs that I’ve been curious about a dbt+Databricks combo. You know, what would it look like, is it worth it, etc.
So, not going to lie, today is going to be more for me than you, I’m curious what my reaction will be to a dbt+Databricks setup, seeing how all the Reddit rapscallions seem to worship at it’s feet (like they do with all things SQL related).
dbt for the uninitiated.
I can’t imagine there are very many people reading this who are unaware of dbt in all its glory, but here is the obligatory overview.
It’s simply impossible to escape the marketing drivel isn’t it.
dbt (Data Build Tool) is an open-source analytics engineering tool that helps teams transform, model, and manage data within their data warehouse using SQL-based workflows. It is widely used in modern data stacks to ensure reproducible, version-controlled, and testable data transformations.
Dbt saves us from unruly SQL, that’s what it comes down to. It also gives us testable and reusable SQL, brings some order to the chaos so to speak.
dbt + Databricks
I’m honestly just interested about this dbt+Databricks+SparkSQL thing and I don’t want to wait around anymore, prefer to just jump in with both feet. My mother still says I’m too much of a hot-head.
When it comes to doing local development with dbt + Databricks, we apparently needs the following things …
“dbt Core enabled as an OAuth application in your account. This is enabled by default.” - databricks
A personal access token - databricks
Ok, that’s a good sign, I don’t want to be throwing salt over my left shoulder and singing Stairway to Heaven backwards, like most other tools. Straight and to the point is how we like it.
If you are following along in your hymnal like your Grandma wants, it would be something like this …
uv init dbtbricks
cd dbtbricks
uv venv dbtbricks
source dbtbricks/bin/activate
uv pip install dbt-core dbt-databricks
After twiddling your thumbs for a minute, that will finish up. Next we can setup a new dbt project.
(dbtbricks) danielbeach@Daniels-MacBook-Pro dbtbricks % dbt init --profiles-dir . bricks
05:54:37 Running with dbt=1.9.2
05:54:37
Your new dbt project "bricks" was created!
05:54:37 Setting up your profile.
Which database would you like to use?
[1] databricks
[2] spark
(Don't see the one you want? https://docs.getdbt.com/docs/available-adapters)
Enter a number: 1
host (yourorg.databricks.com): https://dbc-9a64f31c-25b9.cloud.databricks.com/
http_path (HTTP Path): sql/protocolv1/o/319592733000122/0125-214456-xrlnwni0
[1] use access token
Desired access token option (enter a number): 1
token (dapiXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX):
[1] use Unity Catalog
[2] not use Unity Catalog
Desired unity catalog option (enter a number): 1
catalog (initial catalog): confessions
schema (default schema that dbt will build objects in): default
threads (1 or more) [1]: 1
Note what I needed to be able to complete this.
Host uri of Databricks account
Personal Access Token
HTTP URI of cluster via Advanced Cluster settings under ODBC/JDBC
Catalog name we are using for Unity Catalog
You can read more about the dbt-databricks adaptor and connection options here,(In Production you would most likely use an SQL Warehouse for the endpoint at which to point dbt, I’m using a cluster I created for simplicity of testing).
If you want to test your new dbt setup is good-to-go, including connecting to Databricks, run this …
>>dbt debug
19:28:52 Running with dbt=1.9.2
19:28:52 dbt version: 1.9.2
19:28:52 python version: 3.9.21
19:28:52 python path: /Users/danielbeach/code/dbtbricks/dbtbricks/bin/python3
19:28:52 os info: macOS-15.3.1-x86_64-i386-64bit
/Users/danielbeach/code/dbtbricks/dbtbricks/lib/python3.9/site-packages/pydantic/_internal/_config.py:345: UserWarning: Valid config keys have changed in V2:
* 'allow_population_by_field_name' has been renamed to 'populate_by_name'
warnings.warn(message, UserWarning)
19:28:54 Using profiles dir at /Users/danielbeach/code/dbtbricks/bricks
19:28:54 Using profiles.yml file at /Users/danielbeach/code/dbtbricks/bricks/profiles.yml
19:28:54 Using dbt_project.yml file at /Users/danielbeach/code/dbtbricks/bricks/dbt_project.yml
19:28:54 adapter type: databricks
19:28:54 adapter version: 1.9.7
19:28:54 Configuration:
19:28:54 profiles.yml file [OK found and valid]
19:28:54 dbt_project.yml file [OK found and valid]
19:28:54 Required dependencies:
19:28:55 - git [OK found]
19:28:55 Connection:
19:28:55 host: https://dbc-9a64f31c-25b9.cloud.databricks.com/
19:28:55 http_path: sql/protocolv1/o/319592733000122/0125-214456-xrlnwni0
19:28:55 catalog: confessions
19:28:55 schema: default
19:28:55 Registered adapter: databricks=1.9.7
19:29:08 Connection test: [OK connection ok]
19:29:08 All checks passed!
Running that will command will tell you if you have any problems with your new setup.
Writing dbt models on Databricks.
This isn’t supposed to be a tutorial about dbt itself, although if you are new to it, it may serve the purpose of giving you a high-level overview. I did want to mention some “things” about how dbt works, in the context of Databricks, because I think there are some topics worth talking about.
A word on how your models are materialized … or not.
Materializations are how one would persist dbt models in a warehouse. There are five types of materializations built into dbt. By default, dbt models are materialized as "views".
They are …
table
your model is rebuilt as a table on each run.
view
your model is rebuilt as a view on each run.
incremental
insert or update records into a table since the last time that model was run.
ephemeral
not directly built into the database, aka CTE.
materialized view
materialized views are a combination of a view and a table.
In my humble opinion this is a very important distinction when running dbt, especially in a Databricks context where you may be working within a Lake House that has many 100s+TBs of data.
Do you really want every single query written as a “table” and persisted between models simply because someone wasn’t paying attention? Your storage costs would explode.
We should move on to actually writing a model on Databricks. Note: there are many ways to run dbt jobs on Databricks, we can talk about them later. Let’s get inside our models directory (for dbt) and make something.
(dbtbricks) danielbeach@Daniels-MacBook-Pro bricks % ls
README.md dbt_project.yml macros profiles.yml snapshots
analyses logs models seeds tests
I’m going to skip a lot of best practices here, we can talk about them later, but I just want to see generally what a dbt model looks like for Databricks. I have a pre-existing Delta Lake Unity Catalog table, with Backblaze failure information about harddrives in a data center.
Let’s just write a simple model that will rollup failures by date and model.
{{
config(
materialized = "table",
file_format = "delta"
)
}}
with source_data as (
select date, model, count(failure) as failure_count
from confessions.default.hard_drive_failures
group by date, model
)
select *
from source_data
Doing a simple run …
(dbtbricks) danielbeach@Daniels-MacBook-Pro bricks % dbt run --model models/hardware.sql
22:14:32 Running with dbt=1.9.2
22:14:34 Registered adapter: databricks=1.9.7
22:14:35 Found 3 models, 4 data tests, 607 macros
22:14:35
22:14:35 Concurrency: 1 threads (target='dev')
22:14:35
22:14:37 1 of 1 START sql table model default.hardware .................................. [RUN]
22:14:44 1 of 1 OK created sql table model default.hardware ............................. [OK in 7.63s]
22:14:45
22:14:45 Finished running 1 table model in 0 hours 0 minutes and 9.57 seconds (9.57s).
22:14:45
22:14:45 Completed successfully
22:14:45
22:14:45 Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1
We can see below that a new Unity Catalog Delta Table has been created below that matches our model. Pretty much works as expected.
Yeah, but what about production workloads?
Like most popular tools with plenty of usage, actually putting dbt into production in Databricks would would look a little different, but you have enough options for whatever suits you.
You can easily …
put your dbt models in a git repo, which is cloned and sync’d to Databricks
run Databricks Jobs (Notebooks, or other code)
There is nothing special about this, most people who are using Databricks know how, and already have systems in place, to run Databricks jobs. Integrating dbt models into this workflow would be very straight forward.
If you want to see examples of this, Databricks has a few you can see here, again, nothing special if you’re used to being in the Databricks environment.
Thinking about dbt on Databricks at a high level.
While dbt itself is popular, I want to spend some time thinking about the combination of dbt with Databricks, and why a Data Team might choose to go down this path, because personally, I find this the more interesting question.
Firstly, from everything I’ve seen, dbt and Databricks have first class support for each other. You will have zero problems, on the surface it seems, integrating the two.
But again, why would you use it (dbt) that is.
you get modular SQL
you get easy testing
???
What I’m trying to say is that you can achieve these things without dbt easily enough, in fact, I would say the majority of serious Databricks users DO indeed get any number of “advantages” you might list for dbt, while NOT using dbt.
I’m not bashing dbt, I’m just making a point. Why would a Data Team choose to use dbt as their primary data transformation tool on Databricks?
Because they are a SQL-only based team.
I’ve mostly seen two types of Databricks users, those that do more advanced things with Spark like Machine Learning etc, and those who use it as a simple and basic Lake House tooling, to do basic transformations.
advanced users, like Machine Learning pipelines etc.
basic data transformation (Lake House, Data Warehouse type models)
This first group probably uses mostly PySpark or Scala DataFrame based API code. Write modular, reusable, and unit-tested methods and libraries. This is very common, in this is actually the type of work I do on a daily basis.
The second group are teams that do most tabular type data transformations and are simply more comfortable writing SQL rather than being more “programming oriented.” Also, as far as SQL goes, the bar is lower and it’s easier to build, recruit, and maintain teams that primarily use SQL.
Dbt has an almost irresistable appeal for SQL based teams. It provides what SQL hasn’t given for decades and what teams of struggled with forever.
A modular system, including testing, that brings Software Engineering level best practices to Data Teams that have been void of them.
Should you use dbt on Databricks? If you are a SQL based team and 50%+ of your pipelines are written in SQL, than you are doing yourself a disservice by NOT using SQL.
In my case I already have 90%+ DataFrame code that is modular and unit tested etc, I would gain no benefit from adopting dbt and re-writing everything in SQL. Much of the Machine Learning pipelines I work on simply don’t fit the SQL paradim well.
I can say I am pleased with the simplicity of using dbt on Databricks, it honestly couldn’t be any easier, like at all.