An astute reader pointed out recently that I had failed to follow up on Part 1 of Data Modeling. Much to my chagrin, I must return to this turbid and opaque topic of Data Modeling. Part 1 was just a general brain dump on Data Modeling, how to approach the problem, and other high-level thoughts.
While I think the high level is important, knowing the “why” of doing something … I think it is time we delve into the innards of the beast, and see what fancies we can tease out.
Like most things in Data Engineering, Data Modeling is one of the topics full of intrigue, grey lines, and chaos, with no clear direction in sight.
Thanks to Delta for sponsoring this newsletter! I personally use Delta Lake on a daily basis, and I believe this technology represents the future of Data Engineering. Check out their website below.
Relational Databases vs Lake House
We can break down Data Modeling into two major areas. Big Data and Not So Big Data. Otherwise known as RDBMS (Postgres, MySQL, SQL Server) vs Lake House (Delta Lake, Iceberg, Hudi, Parquet Files, blah).
There is, of course, bleed over between the two, but, generally speaking, there are fundamental differences that are large enough between the two underlying technologies, that they both really require their own set of methodologies and Data Modeling to be truly effective.
What is the same about RDBMS vs Lake House Data Modeling?
Personally, as someone who started out designing RDBMS systems for years, both Data Warehouses at scale in both SQL Server, MySQL, and Postgres, and now for these last few years, Data Models in both parquets ins s3 to Delta Lakes with hundreds of TBs of data … there are many base Data Modeling concepts that apply in both situations.
Ok, so get to it already. What’s the same?
Data Modeling Baseline
Let’s start with some Data Modeling baseline ideas and concepts that will ensure you can tackle a large range of problems. I would also like to insert a comment here, that may be controversial.
Most of Data Modeling, of all types, boils down the the 80/20 rule. 80% of general Data Modeling will get you in a good spot, the last 20% is specific to the technology you’re working with. (RDBMS vs Lake House).
What are those simple baselines that are the same between RDBMS and Data Lake modeling?
logical normalization
business use case/access patterns
physical data localization
data types
constraints
“grain” of data
These items listed above are concrete steps that have to be taken when starting to unravel some mystery to solve a data model. You can’t skip them no matter if it’s Postgres or Delta Lake.
Let’s see if we can sus out some rubber meets the road Data Modeling ideas from the bullet points. We will eat the elephant one bit at a time.
Logical Normalization.
Normalization, ah, that curse word that has started arguments and stirred up bad blood for years in the old days when Oracle, SQL Server, and Kimball Data Warehouse Toolkit roamed the earth. It was always a tricky and fraught subject.
A part of me is actually sad I don’t hear about this topic much, normalization that is. It’s incredibly important, doesn’t matter how deep you jump into the normalization pool. If you’ve been around classic RDBMS databases for any amount of time you’ve heard of probably at least Third Normal Form. Let’s ask ChatGTP what it thinks normalization is.
“Normalization is a systematic approach to decomposing tables to eliminate data redundancy and undesirable characteristics like Insertion, Update, and Deletion anomalies in a relational database. It is a multi-step process that aims to simplify the data structure while maintaining data consistency and integrity. The main objective of normalization is to divide a large table into smaller (and less redundant) tables and to establish links among these tables.”
Well said my unfeeling AI electrons friend. That sums it up fairly well. If you’re still confused, the below diagram shows you on the left not-very-normalized, and on the right, very-normalized.
Of course, there are official 1-5 normalization levels with their own definitions for those who care about such things. I think it’s important to understand the different levels of normalization if you are not familiar with them, but in the end, I think you should stick to “Logical Normalization.” This applies to both RDBMS databases and Data Lakes in a Delta Table.
Break up a data set(s) up logically.
Many times files or OLTP systems skew towards “everything together.”
Normalization reduces data redundancy
Think of the problems with duplicate data and updates for example.
Normalization focuses on the use case of a table, reducing complexity.
The normalization process forces understanding of data relationships (keys etc.)
You think about primary, foreign, or composite keys.
In essence, breaking down and bucketing data into logical tables with normalization is a key part of starting to understand your data, which is a grossly under-practiced art.
Business use case/access patterns.
Very closely related to the last topic of Logical Normalization, is the theory or concept of Business Use Cases or Data Access Patterns. This is typically the point in the process where the “hard” questions are asked.
This is where ambiguity goes to die.
The good part about what has happened in the Data Modeling process up to this point has all been conjecture and theory rooted in the data lying before you. But, at some point, reality has to set in. We have to ask the question of ourselves, and our end users … “Please kind person, pray tell, what do you want from this data set?”
This is probably the most important question, and least often asked on any Data Team. Silly kinda.
Typically the answer to the question, if really asked to the end-users and business, is often different than we imagine or think. Why? Because they simply approach life from a non-engineering viewpoint, unlike us, and have a very different idea of what is possible, what they want, and what is of value.
Imagine the real use cases and understanding you can derive from these questions. Of course, they will affect the data model! You may logically normalize your tables in a different matter to support the data access time-series problems that your end user said is the whole point of the data set.
Physical Data Localization
The next step in the Data Modeling process, be it Data Lake or RDBMS table in Postgres, is the idea of the locality of the data in relation to itself.
It’s a big deal. Have you tried to do some queries on a Postgres table with a few million records and no indexes? How about a Delta Lake query with a few billion rows and no partitions?
Now, some might argue that indexes don’t really co-locate data together, it depends on the RDBMS in use. But it’s typically the underlying question that really matters. You have a bunch of data that you are modeling, you found out what the use cases and access patterns are … what next?
You have to think about how to locate data together in a way that supports, not hinders analytics and queries.
Think about indexes in RDBMS.
Think about Partitions and ZOrder in Delta Lake.
Think about scanning 10’s of thousands of files looking for the needle in the haystack, or you need to gather all data from a certain time period. Anything that can reduce complete scans. When working on a data model, use every tool possible in relation to your technology, RDMBS, or Data Lake, to ensure data is modeled locally together whenever possible, in theory, and in where “data meets the disk.”
Data Types and Constraints
I’m going to lump the next two, together, for the sake of brevity, it’s getting long already. Another basic tenet of Data Modeling, which is often skipped, and left for later, is the thought process around Constraints and Data Types.
But, they are each in their own right, extremely important to the Data Model being thought out. Why? Because Data Types affect things like JOIN speed, lookups, aggregations, and how much downstream processing is required (do you have to cast some STRINGS to INTs all the time?).
Constraints are also critical to the Data Model because they require you to actually know something about the data you are modeling. If you don’t know what the constraints are of the data, you haven’t done your job.
CHECK constraints (say, is this value in this set).
NULL vs NOT NULL
BETWEEN and RANGE
Simple constraints enforce the integrity of the Data Model, that it is valid!
In the same way, Data Types also enforce our view of the Data Model in the real world, and have a big impact on performance. Running a JOIN on 10 million records on columns that are INTEGERS vs STRING will make a difference at scale.
Grain of Data
The last, least known, and most important step in Data Modeling is understanding the “grain” of the data. I’ve seen it happen many times, someone spends tons of time on, weeks, on developing a large data model, of many tables, only to be found out.
The grain of the data was wrong.
Someone tried to answer a question, drill down, and … oops, no can do.
“The grain of the data describes what is the lowest level of detail that a indivudal record describes, in a table”
This is critical to a good Data Model. Being able to describe at what level an individual record exists in relation to its counterpart records, is very important.
Why?
Because you must know what makes each and every record unique. You also must know how this record relates to others, and most importantly, how you can, or cannot roll up and provide analytics based on this data set.
Agreeing on the level of detail provided by the Data Model might drive, for example, more rollup or summary tables to be added to the Model to support Analytics and necessary use cases for Analysts that are unencumbered by confusing nuance.
Closing Thoughts
And here I end my tail of Data Modeling basics. It’s a strange world we live in. One day a person can be modeling some tables in Postgres, another day in Delta Lake. Are they the same? How do they differ?
I mean the underlying technologies between an RDBMS and a Lake House could not be any different. They are complete opposites. Yet, Data Modeling after all these decades, has it really changed much? Sure, we have vendors telling us to do this and that, mostly for their own good, not ours.
Having designed many Data Models of all different shapes and sizes, over both technologies, I have found that it’s strikingly similar. Covering the basics as described above is a must.
But, at the same time, each technology does provide its own nuances that must be understood, or chaos will ensue. Delta Lake is not Postgres, and Postgres is not Delta Lake.
As I continue to flesh out this series on Data Modeling, I hope to dive into examples of both RDMBS and Lake Houses, say Postgres and Delta Lake, and examine them each in depth.
Great intro. Do you have suggestions for books that go into more detail and advanced topics on this?