Simplicity is the name of the game. I know, some engineers just can’t get over their love affair with complexity, it’s like a drug, easy to take and hard to get off of. But, we all know better in the end, our sins will find us out.
Death to complexity. Long live simplicity.
Today we are going to talk about the simplification of Lake House workflows with Delta Lake + Spark features COPY INTO And MERGE. And how they fit into the overall data pipeline picture.
You can build an entire stack of data pipelines on the back of these two Atlas … holding the proverbial Lake House world on their shoulders. How do I know this? Because I’ve done it … like in real life.
Thanks to Delta for sponsoring this newsletter! I personally use Delta Lake on a daily basis, and I believe this technology represents the future of Data Engineering. Check out their website below.
Introduction to Delta Lake
Before we dive into the specifics, let’s briefly recap what Delta Lake is. Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark™ and other big data workloads. It’s the new and upgraded Data Lake.
Delta Lake gives us the power of DDL, DML, ACID, CRUD, and all the rest … at a massive scale. It’s not compute, it’s storage and data representation.
Thinking about Data Pipelines with Delta Lake.
It’s common when Data Engineering Lake Houses, which hold, many hundreds of TBs or more of data, go through different data pipelines and stages through which our bits and bytes pass … to store that data in some representation that is meaningful to the business or to the Engineers.
This creates a very close relationship between our …
ETL and Data Pipelines
Storage Layer
Compute
These three now can be closely knit together if so desired. Some might complain about the separation of concerns, and that is true, but simplicity at scale is also just as important.
This is where the two topics for today fall, MERGE and COPY INTO, they enable simple data pipelines in single functions, that can in themselves support complex use cases. We have to pick one, so let’s start with COPY INTO.
The Power of COPY INTO
Above is a visual of the simple and common use case, especially for Lake Houses. Say we have an s3 bucket that gets filled with CSV files daily or hourly.
We need to get that data from its raw format in our Lake House for downstream processing, all while doing some simple processing of the data. Our source is s3, our target is a “staging” Delta Table.
Well, in the old days that would probably require a bunch of steps like read, transform, and write. Delta Lake’s COPY INTO does this with a single statement. Including idempotency!!!
Below is an example of doing just that.
This is one of the best-kept secrets of Data Pipelines with Delta Lake. These operations are idempotent, meaning they will only copy a new file once. Of course, what you see above is only a small part of the configurations that can take place.
You can read more on the documentation here. Above you can see all the following steps happening in a single command.
Copying any new files from source to target.
Doing transforms like casting to timestamps and creating hashes, etc.
Loading data from a remote cloud bucket with pattern matching.
Imagine the possibilities!! The COPY INTO command is perfect for staging type loads of raw data going into a Lake House. It can be idempotent, it can apply transformations, it can pattern match, and it can be an all-in-one load. Simple!
The Power of MERGE
MERGE, often called upsert (update + insert), allows you to merge a source table with a target Delta table. With MERGE, you can:
Insert new records
Update existing records
Delete records that match given conditions
This is a great feature as you can even tackle SCD / history recording tables with the INSERT, UPDATE, and DELETE capabilities of a MERGE statement.
Let's take a look at an example:
Scenario: We have a products Delta table that we want to update based on the new inventory data coming in.
In the above SQL code:
If the product exists, it updates the quantity.
If the product doesn’t exist, it inserts the new product into the
products
table.
There are very few use cases when moving data between Lake House logical storage layers that can’t fit into the concept of the MERGE statement.
Now … think about combining both COPY INTO and MERGE into a single pipeline.
Benefits of Using MERGE and COPY INTO in Delta Lake
Efficiency: Both commands are optimized for performance. Rather than writing custom logic, these commands handle most use cases with a simple SQL statement.
Simplified Data Operations: No need for complex logic. Just express what you want to do in a simple SQL statement.
Wrapping Up
Delta Lake has transformed the world of data engineering by bringing ACID transactions, scalability, and simplicity to big data workloads. Commands like MERGE and COPY INTO further simplify data operations, ensuring that data engineers can focus on providing value rather than wrangling with complex operations.
Give them a try in your next project and experience the difference!