Diving into Data Types

For Data Engineers

Feb 12, 2024

This is an interesting one, especially for Data Engineers who’ve spent their life under the thumb of SQL and Python … which is probably most of us. That’s our life.

Keep daydreaming about Rust, ain’t going to happen.

It’s very probable, for most data folk, that they think of Data Types in the context of SQL … say Postgres data types, and if they are heavy Python users, maybe a list or dictionary comes to mind.

Dare I say it, when everything is an object in Python, and you’ve never learned a compiled language … you simply stop caring after a while. That’s ok … most of the time until it’s not.

You should check out Prefect, the sponsor of the newsletter this week! Prefect is a workflow orchestration tool that gives you observability across all of your data pipelines. Deploy your Python code in minutes with Prefect Cloud.

There are Data Types, and there are Data Types. You know what I mean?

When it comes to Data Engineering, those who spend their days pushing and pulling vast amounts of data over the hill and through the woods … to you know’s house … can get a little weary of the subject.

But, we also think about Data Types slightly differently than most classic Software Engineers I think.

They are worried about which memory structure takes up too much space in memory, and which is the best way to access some thingy in some other listy-array-thingy. Data Engineers worry about the data type of column in 300TBs of Parquet files stored in an s3 bucket.

The same, but not the same.

So what ARE we talking about? Postgres columns? File storage columns in Parquet files? Lists, Dictionaries, Strings in a Python program? Dare I say an array or struct in Rust?

We are talking about all of it.

The Goal

What we want to do today is to simply raise “Data Types” into the minds of those who don’t think about them much. We just want to give you a gentle pat on the head, like your dear old grandmother, give you a cookie and some tea, and say “There, there, now you don’t think it’s wise to completely ignore data types your whole life do you?”

Data At Rest

Let’s start our little journey talking about Data Types we use at rest, in storage and common mistakes Data Engineers make who get lazy, old, both, or just don’t know any better.

And, we could ask why it matters, let’s just store everything as STRING or VARCHAR like most people do.

Disk space matters (data types impact storage size)
Reading data from disk only to convert it to another data type is wasteful and time-consuming.
Correct data types reduce errors in increase data quality.

Let’s take Parquet files as an example. Let’s list all the possible data types we could store … and you tell me how many you have actually used.

My guess is not many of them. I mean we could apply the same concept to Postgres and other storage systems as well. Typically they provide a wide range of very specific Data Types.

Yet, most Data Engineers when working on defining schemas or data models simply don’t think very hard about this part.

Questions you should be asking yourself.

Is this really a Strings?
Do I know the precision of this Decmail?
How big are the Integers in this column?
Should I use ‘Yes’ and ‘No’ or Boolean, 0 or 1?
Are these values actually Timestamps, or just Dates?
Is there a complex Data Type like Map or List that makes sense for this data?

Next time you are bored on a Friday and crusing around your codebase, look for all the `CAST()` calls in your code … SQL, Spark, whatever. And then ask yourself … why?

Remember, this stuff (data) adds us over time. Let’s prove the point in a simple fashion.

I downloaded a bunch of free data from Divvy Bike trips data set.

We have about 571 MB of data in flat files (CSV).

Let’s take them and convert them to compressed Parquet files … once as all Strings … and another time with data types correctly converted.

import polars as pl

def main():
    dtypes = {"ride_id": pl.String,
             "rideable_type": pl.String,
             "started_at": pl.String,
             "ended_at": pl.String,
             "start_station_name": pl.String,
             "start_station_id": pl.String,
             "end_station_name": pl.String,
             "end_station_id": pl.String,
             "start_lat": pl.String,
             "start_lng": pl.String,
             "end_lat": pl.String,
             "end_lng": pl.String,
             "member_casual": pl.String}
    data = pl.read_csv("*.csv", dtypes = dtypes)
    data.write_parquet("pickles/data.parquet")

if __name__ == "__main__":
    main()

That gives us about 151MB

What happens if we apply the correct data types?

correct_dtypes = {"ride_id": pl.String,
             "rideable_type": pl.String,
             "started_at": pl.Datetime,
             "ended_at": pl.Datetime,
             "start_station_name": pl.String,
             "start_station_id": pl.String,
             "end_station_name": pl.String,
             "end_station_id": pl.String,
             "start_lat": pl.Float64,
             "start_lng": pl.Float64,
             "end_lat": pl.Float64,
             "end_lng": pl.Float64,
             "member_casual": pl.String}

We get a slight storage savings 151.3MBs, about ~1%, so not a ton. But we also get more accurate data! Having the right data types when data is “at rest” or “on disk” makes it easier when we transition the data into memory and reduces the antics we later have to deal with.

Sometimes diving into Data Types simply means actually looking at your data and applying the correct Data Type.

Data In Memory and Code.

Thank you for reading Data Engineering Central. This post is public so feel free to share it.

I suppose at this point we could move onto the slightly more alluring topic of Data Types in memory .., aka our code. This can be a little harder of a topic to tackle, simply because of the sheer size and breadth of what we could cover.

There are lots of opinions and tears that can be shed on this subject, but let’s simply try to give a very basic overview of Data Types in memory, say with Python and Rust.

We could all use a refresher at some point no?

Non-Python Data Types for Python Folk

My gut says you are coming here with at least some Python under your belt. Maybe that's your entire shtick. On the off-chance that you haven't witnessed what other languages bring to the table, I'd like to briefly talk about a few of them.

First, there's this little guy called C. It's kind of famous, especially if you've ever used a computer before. It sets the bar for many languages in use today, so it's a great starting point.

struct Thing {
    int a;
    int b;
    char c;
}

There it is. A well-defined data type with three fields, each of which has a known size in memory to the compiler. It serves only one purpose - to codify the structure of a Thing; no trickery regarding field visibility, member functions, polymorphism, etc.

Well, since C came along, we've had other languages take a stab at things. C++ lets you use that same exact struct in your code with no alteration and it just works, but they couldn't leave well enough alone; they also have all that baggage I mentioned just a moment ago:

Here, we included functions, started playing with visibility, and included the concept of inheritance. Also, as mentioned in the comment, we could have omitted the b = 0; statement, since it automatically initializes to 0.

Some Rust

Then, there's Rust. It looks at the above, and says there's a "better" way. A Rust struct has all the same moving parts, but is a lot more formal about how things work:

Where Rust really differentiates itself from C++ is in the separation of concerns - the shape of the struct is independent of its behavior.

Also, unlike C++, there is no concept of inheritance; ThirdThing can't simply bolt onto another existing struct definition to get free fields, taking more of a cue from the earlier C example.

...And, if you don't like having to spell out defaults in your constructor (e.g. the b: 0 assignment shown above),

Rust allows you to bolt on some built-in behavior using its Default derivation:

And how about Python?

Traditionally, Python was very relaxed about defining structures. Or, rather, you didn't; you simply threw things into one of its existing constructs and made do.

For instance, if you wanted an object with fields a and b as with our other examples, you'd probably reach for a dictionary:

thing = {'a': 5, 'b': 0}

Maybe, if you're a Python lifer, you see this and think this is an argument in favor of Python over the cruft and ceremony of the other languages. And, you might be right, depending on your needs.

However, if the goal is to have a consistent and inviolable definition for what a "thing" needs to look like, you have no guardrails here, and bad data can derail your program.

If that were the end of the story, I'd simply tell you to jump ship and switch to Rust. Thankfully, it's not game over for Python, thanks to classes.

class Thing:
    def __init__(self, a):
        self.a = a
        self.b = 0

Well, it's self-contained, at least, but compared to the other languages, it's still very sketchy - aside from the initialization of field b, there's nothing type-related here to tell us what a Thing actually is.

We'll need to dig just a bit deeper into the Python ecosystem to get what we need:

Dataclasses

Since version 3.7, Python has offered dataclasses as a way of defining the shape of a data structure. It depends on the use of type annotations (as seen in PEP 526) to wire up some default behaviors for us, and given those type annotations we also have a well-represented schema for the shape of our data.

Employing this technique, our Thing could be better represented as shown below:

from dataclasses import dataclass

@dataclass
class Thing:
    a: int
    b: int = 0

The @dataclass decorator, by default, brings a lot to the table, one of those things being that it autowires a __init__ function for us that takes a single parameter to populate the a field - giving us feature parity with the previous example. Consider the following script (building off of our current definition):

from dataclasses import dataclass

@dataclass
class Thing:
    a: int
    b: int = 0

thing = Thing(3)

print(thing)

# Output: Thing(a=3, b=0)

We got a __repr__ implementation for free, allowing us to print thing directly and get readable output. In fact, we get a lot of really cool things by default, which are described in detail in the earlier-linked Python docs page.

Unfortunately, we don't get any runtime constraints - nothing stops you from blowing out one of the fields with invalid content; for instance, after instantiating Thing(3), we could follow up with a statement like thing.a = 'x' and get away with it, defeating the purpose of explaining the member types.

This can be compensated for, for instance by using properties and hidden members, but let's be honest and admit that it's heavy-handed and we're asking Python for something it wasn't originally designed to handle.

There are better languages out there for this kind of job, and trying to protect developers from themselves in one of the most flexible languages available is perhaps a bad use of your time.

Bringing value

So, we talked a bit about what data structures are, and gave a few trivial examples in different languages. As a data engineer, the primary benefit of having typed structures is that it establishes a clearly defined domain to work within. We're modeling important things, and we want to know what those things are when working with them.

First, we're probably dealing with collections of things.

Frequently, these collections are homogeneous lists, wherein the items in the list are of the same type. SQL tables, for instance, are consistent and strongly-typed collections that let us operate in batch rather than forcing us to deal with field consistency at the record level.

In a language like Rust, it's very easy to make it clear in code what we're up to:

struct Person {
    pub last_name: String,
    pub first_name: String,
    pub middle_name: Option<String>,
}
type People = Vec<Person>;

Now, any time we see an instance of People, we know with certainty that we're dealing with a list of entities that consistently have a last and first name, and may or may not have a middle name. You can easily project those columns because you know with confidence that they're present and are all of a consistent type.

What if your list isn't homogeneous? If the possible types are well-understood and of a reasonable quantity, Rust can still help you out with its algebraic data types:

struct Person {
    pub last_name: String,
    pub first_name: String,
    pub middle_name: Option<String>,
}
struct Business {
    pub name: String,
    pub doing_business_as: Option<String>,
}

// Our entity could be either a Person or a Business:
enum Entity {
    Person(Person),
    Business(Business)
}
type Entities = Vec<Entity>;

Here, we're dealing with lists of items that could be either a Person or a Business, and the language forces us to manage both any time we're dealing with the collection

// Get display names for our collection of entities
fn get_display_names(entities: Entities) -> Vec<String> {
    entities
        .into_iter()
        
        // Match statement requires comprehensive coverage of all arms:
        .map(|entity| match entity {
            Entity::Person(person) => format!("{}, {}", person.last_name, person.first_name),
            Entity::Business(business) => business.name.clone(),
        })
        .collect()
}

Above, you can see that it takes a bit of work to tease the two entity types apart, but because our language has the details of our data structure available, it can let us know when we've neglected to handle something.

For instance, if we add a new type to our Entity enum, but make no changes to the get_display_names function, we'll get screamed at by the compiler:

enum Entity {
    Person(Person),
    Business(Business),
    Other(String),
}

// Compiler output:
// .map(|entity| match entity {
//                     ^^^^^^ pattern `Entity::Other(_)` not covered
// non-exhaustive patterns: `Entity::Other(_)` not covered
// help: ensure that all possible cases are being handled by adding a match arm with a wildcard pattern or an explicit pattern as shown

With such a degree of type safety in place, we can fearlessly model our data in code, and know that the compiler is keeping track of many ways we might botch handling it.

Wrapping It Up

Yikes, we ran through the whole world there, didn’t we? Covered a lot of stuff. I can’t even remember where we started, my brain hurts. Oh yes, Data Types “at rest”.

Data Types are one of those topics we all either kinda talk around, without actually talking about them. Or we simply ignore them and don’t think about them much.

We reach for the stuff we get used to, we are creatures of habit. Everything is just a String in the database. Or throw a bunch of stuff in a list in Python. Who cares? Or should we?

I will let you be the judge of that.