If a new engineer started on your team and was assigned a task requiring some work around defining a data spec and the ability to store some messages from another system.
They seemed excited and buoyant, working hard all week on their grand design, bumbling and stumbling through standup meetings, you’re kind, so you let them do their thing.
I mean you want them to learn right? It will be fine.
You ask for an update at the end of the week on how far they’ve gotten so you can give some feedback. And then it happens. They send you an example XML file.
Yes. That’s right. XML.
Your eyes glaze over. You see red. You can’t speak. How could they? XML? What is this 1995?
Later at home, you feel bad. Semi-structured data. You have never given it much thought. You’ve even made fun of JSON. But it gets you thinking. Is there more to semi-structured data than meets the eye?
This blog is brought to you by Firebolt. It’s a fast cloud data warehouse I’m currently exploring for low-latency use cases, and it’s able to deliver sub-second response times over TBs of data.
Data and Software in the Real World.
In reality, modern software and data generally aren't very simple. Getting your data structured in the right format is a critical function in many applications, and your needs will vary depending on the needs of your software and business.
Of course, much of the data stored in Data Warehouses comes from business applications and isn’t very clean, clear, or structured.
Today we are going to talk about semi-structured data.
If you stop and think about it, most data that we all deal with, flying to and fro around our s3 buckets and Python code, probably falls into one of two categories.
structured data (think relational databases)
semi-structured data (think JSON)
If we are honest with each other, compared to structured data, semi-structured data gets a bad rap.
This is for various reasons, some good, some bad, some misguided, some not.
It’s important to note while diving into this topic that we can deal with semi-structured data in two ways.
storage on disk.
data types in memory.
You might think this is obvious, but as we will see later, the lines can get blurred when you store JSON inside Parquet or Postgres.
What is it then? Structured or semi-structured?
What does semi-structured data look like?
At this point we should probably at least try to define what semi-structured data is. I imagine there is some disagreement on the subject, but that’s the spice of life.
Semi-structured data is data that does not map to a rigid structure and schema, like traditional relational databases. Yet, it still contains some properties that make it more accessible to understand and use.
In the real world, this semi-structured data is represented in certain formal data types like JSON or XML, and in memory with structures like dictionaries and maps.
Examples.
For those new to semi-structured data, let’s look at two of the most common formats that are known as the Godfathers of semi-structured data.
JSON
XML
JSON
JSON (JavaScript Object Notation) is probably the most common data interchange format. It has some obvious advantages …
easy for humans to read and write
easy for machines to parse and generate
JSON is most commonly used to send data in web applications (e.g., sending some data from the server to the client, so it can be displayed on a webpage, or vice versa).
Many applications omit logs and data in JSON format.
Here is an example of JSON.
In Python, you can use the json module to read and manipulate JSON data. Here's a basic example:
XML
XML (eXtensible Markup Language) is known as a markup language that is used for encoding files in a format that is reasonably human readable.
It’s common to see XML used for configuration files
If you’ve ever seen raw HTML, XML is similar in the sense that it uses “tags” to define objects and the text in these objects.
But, XML allows anyone to define their own tags, making it a great option for various data representation needs.
Here is an example of an XML file.
Example of reading XML with Python.
XML is widely used in web services, configuration files, and in applications where data needs to be exchanged between different systems. However, JSON is often seen as a good replacement for XML depending on the situation.
More Semi-structured data types?
Well, yes and no. We would be remiss if we skipped two data types that might be considered quasi-semi-structured, in the sense that these data structures are popular tools used to hold semi-structured data themselves.
As well, depending on the underlying technology in which these two data types are implemented, they will many times contain different data types which makes them semi-structured because one can’t rely on a contiguous set of data types. Let’s look closer.
Maps + Arrays
Most folk probably don’t think of Maps and Arrays as semi-structured data, but you should, for several reasons. They are commonly used and in languages like Python, you can mix data types, so the structure is much more “unknown.”
Also, many data storage systems, like Delta Lake, and others, support Maps and Arrays as data types.
Of course, with data types like Map, what was once a single column of known data type can change to be a mix, hence the data becoming only semi-structured.
Semi-structured data in systems like Parquet and Postgres.
Semi-structured data has become so ubiquitous across the data landscape that common file formats like Parquet, used extensively in big data, and databases like Postgres, directly support storing data in these formats.
For example, if you examine the Parquet data type docs you can find support for …
JSON
BJSON
Lists
Maps
Here we use Python’s pyarrow module to write data to a JSON column inside the parquet file format.
It’s extremely common to even run across semi-structured data inside a Postgres database, and we all know there are few things in life more structured than a relational database.
Should I use semi-structured data?
Ok, so we’ve probably beat that concept into the ground. By this point, we should all know what semi-structured data is, and we’ve seen some examples of working with it in code.
“On the surface it’s apparent that working with semi-structured can be more complicated than structured data, to work with.”
So what gives?
If your data isn't flat, it gets complicated. It could be stored in an RDBMS with a technique known as "normalization" - breaking nested data out into separate tables and building relationships between parent and child entities.
It’s common to “flatten” out semi-structured data as well.
The effort might be great enough to justify looking at other options, though, which takes us to the topic of simply learning to work with semi-structured data as a first-class citizen.
More Thoughts on Semi-Structured Data
The reality is that semi-structured data is a fact of life and a balancing act. Especially for those working in the Data world around analytics and Data Warehousing. It’s common to deal with large datasets made up of semi-structured data.
It could be in the form of s3 buckets full of JSON files, or a Delta Table with complex data types like Maps and Arrays. Either way, this can be overwhelming for Data Engineers who’ve not worked much with this type of data.
Do you flatten and remodel the data? Do you find tools that can easily deal with complex data types? Do you find something in-between, keep some data semi-structured, and flatten out other parts?
Positive Aspects of Semi-Structured Datasets
It’s one thing to simply have a pile of semi-structured data sitting in a cloud storage bucket, but what happens when you need to add that data to a Data Warehouse, do JOINs, aggregations, and more?
Semi-structured data for Data Warehouses and Analytics
As previously mentioned, this is where the proverbial rubber meets the road. It’s one thing to learn about semi-structured data, to store it away in cloud buckets, or even embedded in systems like Postgres or Delta Lake, but what do you actually do when you need to use the data?
This is where Data Engineers will have to step into the breach and spend some serious time understanding the usage of this data, and designing data AND compute models that will deliver the results.
The options at a high level are …
Flatten and normalize the semi-structured data.
Leave the semi-structured data as is.
Deal with “unpacking” in memory with compute
Somewhere in between these two options.
We can visualize this in the following drawing.
What types of operations do we have to consider when working with Data Warehouses and analytics … and that may become challenging with semi-structured data?
Indexing
Joins
Aggregations
Filters
The question is … if we are dealing with semi-structured data, say millions of JSON files in an s3 bucket with a highly ragged structure … how do you deal with this?
Small file problem, network, and disk I/O.
How much of the data to flatten vs leave semi-structured
What data points will be used for JOINS or Filters?
What sort of functions may be needed to unpack semi-structured data?
If any one of these problems is not properly thought out and dealt with, the dataset will become unusable for downstream processing.
Pros
What makes semi-structured data so great is its flexibility - it can accommodate a wide variety of data needs, with a broader capacity for the things that make structured data a struggle.
The hierarchy of the data is negotiable, records can be heterogeneous or mixed in nature, and guard rails for data integrity are optional and can be pulled in when it's convenient.
Also, semi-structured data is considered scalable and rich, in the sense that more complex ideas and data can be expressed easily, something that is hard in structured and tabular datasets.
Cons
Ultimately, the disadvantages are varied, depending on which format you move forward with. Some notable examples are given below:
These formats are often tedious to edit by hand, particularly if generated by software that eliminates whitespace for file-size reasons (in particular, XML and JSON suffer from this). In some cases, your entire document might live on a single line of text, which aside from being difficult to read or edit often also has the side effect of bogging down or crashing your text editor.
They may not be the most efficient storage method available, particularly for binary data such as images.
Closing Thoughts
Today we hopefully gave you a gentle introduction to semi-structured data at a high level, what it looks like, the forms it takes, and a basic view of how it’s used in storage systems and in memory.
We also broached the topic of semi-structured data in the world of Data Warehousing and Analytics, talking about some of the challenges it presents in complex operations like Joins, Filters, and Aggregations.
Thinking through these problems with Data Modeling and in-memory unpacking prior to implementations is key.
In the future we hope to dive into more detail and examples around semi-structured data in Data Warehouses and Analytics, how to make data models, and use concepts like lambda functions to efficiently and effectively work with semi-structured data at scale.