Delta Lake - Map and Array data types

Hidden Gems?

Apr 22, 2024

If you’re anything like me after years of the same old same old, you can get lulled into a dull stupor of SQL and Data Types that never seem to end.

» String
» Int
» String
» ohhh … look at that a FLOAT!

We wizened old Data Engineers who’ve been encrusting ourselves with the same old Data Types since the days when SQL Server and Oracle ruled the world get a little bleary-eyed when it comes to something new.

You should check out Prefect, the sponsor of the newsletter this week! Prefect is a workflow orchestration tool that gives you observability across all of your data pipelines. Deploy your Python code in minutes with Prefect Cloud.

We think a FLOAT is exciting. Wait until MAP and ARRAY data types hit the scene. It’s like a high school keg party behind the old oak tree at midnight. Things are about to get crazy.

Delta Lake is one of those amazing and interesting tools that people don’t appreciate enough. 15 years ago when were all using Postgres or storing Parquet files on s3, who would have thought that something like Delta Lake could exist?

ACID, CRUD, and SQL, all through an API ontop of Parquet files in the cloud. Amazing. Combining Delta Lake with Dataframe libraries like PySpark or Polars has become the new standard of Lake House and even Data Warehouse modeling.

In an effort to expand the horizons for you Data Engineers who are stuck in the doldrums, today we will take a look at two Data Types available in Delta Lake, the lesser-worn road that only the brave dare walk.

Array and Map Data Types

Let’s talk about these two underused Data Types and see what we think.

ARRAY - Represents values comprising a sequence of elements with the type of elementType.
MAP - Represents values comprising a set of key-value pairs.

Most people are probably more familiar with the concept of Arrays and Maps from programming, not so much storage data types. I think that creativity is one of those rare traits that make for really great Engineers. The ability to think outside the box, being able to see the solutions that are not that obvious.

This is probably why you don’t run into Map and Array much in the wild. They are simply forgotten. There is no creativity when examining problems and finding solutions.

Let’s look at a quick example of what these data types look like, just to get an idea.

>> SELECT ARRAY("Old", "Grandma", "Huckleberry") as WhosYourGrandma;
["Old", "Grandma", "Huckleberry"]

Ok, so that probably looks familiar to most programmers, the “list” type syntax of `[]`.

I mean, how else does it act like an Array?

>> SELECT ARRAY("Old", "Grandma", "Huckleberry")[0] as WhosYourGrandma;
Old

Interesting, I guess with can slice and dice with indexes as we would expect. What else?

>> SELECT EXPLODE(ARRAY("Old", "Grandma", "Huckleberry")) as WhosYourGrandma;

WhosYourGrandma
1 Old
2 Grandma
3 Huckleberry

Not bad. Pretty creative for messing around in SQL uh?

What about Maps?

>> SELECT map('Hobbits', 'Shire', 'Orc', 'Mordor');
{"Hobbits": "Shire", "Orc": "Mordor"}

All right, all right, the classic dictionary.

>> SELECT lotr['Hobbits']  FROM VALUES(map('Hobbits', 'Shire', 'Orc', 'Mordor')) AS T(lotr);
Shire

Not bad. Seems like it could get complicated quickly. Sometimes you need complicated though.

Thank you for reading Data Engineering Central. This post is public so feel free to share it.

Real World Stuff

All this begs the question of how we could use Array and Map in the real world of Data Lakes for Data Engineering. Of course, this wouldn’t be the “typical” everyday use case that is for sure, but where it is usually, it’s probably a great innovation.

When we talk about Data Types we should always be talking about Data Modeling.

Probably, most of the use cases for Array and Map data types come from two sources …

The handling of semi-structured data
Being creative at the TB+ level with access and storing data

If you’re just used to playing with CSV and tabular data, most likely you don’t have much use case for these data types. But, if you use or ingest semi-structured data like JSON, or you have many TBs+ of data, you need to be “creative” with Data Modeling because it matters at scale.

For example, say we are ingesting JSON-type values into a Data Lake …

customer_id | customer_name | customer_address_1_2
   325234       Billdad        {'address_1': '123 Boring Street', 'address_2: '456 Ding Dong Road'}

This can be for simplicity's sake, keeping data close together. It’s just another way to view and interact with the data.

customer_id | customer_name | last_6_monthly_sales
   325234      Billdad          [500, 800, 700, 200, 100, 350]

For example, in the above picture, you will note that the last 6 months of sales are stored ALONG WITH the customer record, this can be extremely helpful at scale.

It may not make sense if you’re storing data in Postgres, but when you have 300TBs of data in a data lake, expressive data modeling like this can have a major impact on analysis and analytics, down the bottom line, saving tons of compute and money over the long term.

For example, read this excerpt from Zach Wilson, the Staff Data Engineer and very popular creator.

“ARRAY and STRUCT data types can be a slam dunk for representing very high cardinality data in a compact way. Imagine you needed to represent the next 90 days of predicted prices.

You could have a schema like:

id BIGINT
predicted_price DECIMAL
probability_of_purchase DECIMAL
predict_date STRING
ds STRING

The problem with this is you’ll have a lot of rows as the data grows (90 X number of distinct ids). If that number is in the tens of millions to billions, this schema breaks down since you’d have 90 billion records if you have a billion ids.

Imagine an efficient schema like this

id BIGINT
predicted_prices_next_90d ARRAY(STRUCT(probability_of_purchase DECIMAL, price DECIMAL))
ds STRING

The array index of predicted_prices_next_90d is the number of days after ds the prediction is for. If you use this schema instead you have 1 billion records and the data volume is dramatically less.

Definitely check stuff like this out. I think it’s the future of data modeling.”

Zach Wilson

This example shows perfectly the power of out-of-the-box thinking and data modeling with Array and Map type objects.

Learning important lessons.

It’s important to recognize we can all get caught up in the day-to-day cycle of building and shipping pipelines and data projects. We use the same old data types year in and year out.

We forget that these powerful new tools like Delta Lake have a lot of hidden gems hiding right under the surface. We should also recognize, as Data Engineers, we have the responsibility to slow down and think critically about the things are are building.

It’s important to at the very least be well-versed and knowledgeable with the tools we use. There are a lot of people using Databricks and Delta Lake to build data products at scale.

The number of those people using Array or Map data types is probably so small it’s like Horton Hears a Who.

Just remember, be creative, think about the best way to build out solutions, and consider Map and Array next time.

Have you used these days types before? In what way? Please share your experience below in a comment and help others learn and grow.