Thoughts on data in the Cloud (S3).

Cloud Storage for Data Engineers.

Oct 16, 2023

It’s one of those subjects that’s sort of taken for granted, working with cloud storage like s3, that is. Yet, it’s probably one of the most common and, many times, one of the first tasks newly minted Data Engineers, fresh off the assembly line, work on during their first year of agony and confusion.

Files in s3. It seems like such a simple topic, yet it’s such a fundamental piece of pretty much all Data Platforms. You would think we would be experts by now.

That’s the plan for today. To explore the wind-swept and ravaged shores of s3 buckets with seas of data. With code. With the CLI. Forward and onward!

Thanks to Delta for sponsoring this newsletter! I personally use Delta Lake on a daily basis, and I believe this technology represents the future of Data Engineering. Check out their website below.

How Data Engineers do s3.

We are going to explore cloud storage from the viewpoint of s3, the most ubiquitous storage of them all. Amazon Simple Storage Service (S3) is a scalable object storage service provided by AWS (Amazon Web Services).

The simple idea of storage, with something like s3, has changed a lot since the days of yore, turning what once used to be a boring part of a Data Engineers job … into one of the most important. Things are simple anymore, infact, they can be very complex.

In a strange turn of events, it almost becomes impossible to cover all aspects of the topic of “cloud storage” in even a few articles. We almost have to pick and choose.

Think about that little graph above listing just a few things Data Engineers do when it comes to cloud storage. Yikes.

To keep things down to earth and our feet in the mud, I think we will pursue at a high level the main points of cloud storage, how we should think about the topic, and from there move into actual examples using CLI’s and code to interact with s3.

This approach may jog the memory of some, introduce a new idea to others, spring up new ideas, and hopefully be all-around helpful. Let us begin this clouded journey and see where it goes.

In the context of Data Engineering s3 plays a critical role in several areas:

Data Storage:

Raw Data Lakes: S3 is frequently used to store vast amounts of raw data, which can be processed and transformed as needed. This allows businesses to decouple storage from compute.
Processed Data Storage: Once data is processed, transformed, or aggregated, the results can be stored back into S3 for further analysis or consumption.

In summary, s3 is the garbage heap of our digital selves. We dump, dump again, and dump more. It’s a habit, it’s easy to do, so we do it.

This dumping and “not taking seriously” s3 or cloud storage is the first pitfall many Data Engineers make. And let me tell you, it’s a hard one to back out of once things are a mess.

You should apply rigor and thought to cloud storage in your Data Stack. Apply Engineering best practices to everything you do. For example.

Be consistent with the naming conventions of s3 buckets.
Understand the configurations of s3 buckets.
Be consistent with directory structures.
Understand what data partitioning is, and how to apply it in a s3 bucket.
Understand how data compression affects the costs of s3 storage. Be wise, compress.

When thinking about cloud storage systems like s3 we also can’t forget they provide benefits that we rarely think about. Mostly because they work so well we take it for granted.

Scalability & Durability:

Highly Scalable: You can store any amount of data in S3, from a few bytes to petabytes, and it can handle large concurrent workloads.
Data Durability: Amazon S3 is designed for 99.999999999% (11 9's) of durability. It replicates data across multiple systems in a given AWS region.

And that’s not all. One thing we think about a lot in Data Engineering is interfaces and how systems work together. A lot of Data Engineering pipelines and environments are made up of multiple tools working and talking together.

The nice thing about cloud storage, like s3, is that many tools offer out-of-the-box support. It just makes life easier.

Integration with Data Processing Systems:

Big Data Frameworks: S3 easily integrates with big data frameworks like Apache Spark, Apache Hadoop, and Presto.
AWS Native Services: AWS services like Amazon Athena, Amazon Redshift Spectrum, and AWS Glue can directly interact with data stored in S3.

Event-driven Processing:

S3 Event Notifications: You can configure S3 buckets to send notifications (like Lambda function triggers) when specific events (such as PUT, POST, or DELETE) occur, facilitating real-time data processing workflows.

Share Data Engineering Central

s3 - code and CLI.

In the end, the life of an average Data Engineer ends up being code. Code, code, and more code. Working with cloud storage like s3 is no different.

There are endless ways to interact with data in the cloud, let’s examine some of the ways we can directly interact with files in the cloud. I’m going to approach it from the perspective of s3 and narrow it down for us.

boto3 with Python (code package/API).
CLI (aws provided command line interface tool).
Using code or bash to call the CLI.
Via a tool like Spark (we will leave this one for later).

My personal interactions with cloud storage like s3 usually come from within a tool like PySpark or PyArrow. But, of course, there are many times when we are munging around raw data two and from s3 buckets.

Looking for files, getting files, filtering files, putting files. Always something. Honestly what I want to do is just introduce a variety of common tasks.

How to do common s3 tasks as a Data Engineer.

This isn’t supposed to be an exhaustive list, just things that I’ve done many times over during my career, and I’m assuming you also, if not yet, then soon to come probably.

I will give you the “what,” like maybe what this would accomplish and why, and then the “how.”

Without further ado and in no particular order.

sync two locations, one of them being s3, delete something that doesn’t exist in the source but is in the target.

To sync two S3 buckets while excluding all .parquet files

Sync two S3 buckets, excluding all .parquet files and only including files that have "2023" in their path or name

To copy all .gz files from a local folder (including its subdirectories) to an S3 bucket

Of course, you could also list buckets with the CLI using the same combination of —exclude —include —recursive or whatever. It’s the little things that make a difference, and learning the nuances of searching a s3 bucket comes in handy.

Also, it’s important to note that more complex workflows with the AWS CLI tool can be written with bash. Being able to have bash scripts that do certain common tasks, take input, and that can make certain small logic switches … well … that’s very powerful.

For example …

For every bucket in the AWS account, if the bucket name contains the word "backup", sync all .log files from a local directory (/local/logs/) to a folder named logs inside that bucket.

The sky is the limit, along with your imagination.

Code + s3 (boto3).

This is one tool I’ve been using most consistently over the years as a Data Engineer. At some point, we are typically working on a pipeline or project that requires more “fine” work with cloud files.

If you’re using Python in an AWS environment then boto3 is the most logical choice to do that work. It’s truly amazing the things you can do with boto3. It’s impossible to cover all the features of boto3+s3, but we can give a quick overview.

Before you can start using boto3 with S3, you need to set up authentication credentials. These can be set in a configuration file or in the environment variables.

Create an S3 client or resource instance.

Create or list buckets.

Upload, download, delete, or list files.

Directly read or write from S3 objects using Python's file-like interface.

I mean think about it … boto3 combined with Python logic can do many strange and wonderful things. Like getting the latest file from an s3 bucket after paging through results.

Again, the sky is the limit with boto3, it’s a powerful tool. Being able to combine Python and s3 into a single script, you can pretty much do anything your mind can dream up.

I feel like even after blowing all this hot air we have barely scratched the surface of cloud storage for Data Engineers. That means we should probably do a mini-series on this and dive deeper into some of these more complex topics and code use cases.

What it really boils down to.

Take your cloud storage seriously. Use CLIs, they are powerful when combined with bash. Never forget the power of Python+s3=boto3.

Don’t treat your storage buckets like the junk drawer.