Amazon (S3) Tables

... the truth of the matter

Dec 19, 2024

Vendor Lock in and “Closed” Technology.

S3 Tables are not a Lake House, they are also not commodity file storage on which you can build your Lake House. They are an Amazon table lock-in.

Let me explain:

We live in a Lake House world don’t we? It’s fair to say that the death of the Data Warehouse happened slowly over time, but that the rise of storage format technologies like Delta Lake, Iceberg, and Hudi put the final nail in the proverbial coffin.

I know we all wish we lived in a hand-hewn cabin in the woods and drank our water from a clear bubbling stream, but alas, here we all are stuck in Data Land, and one of the hottest topics today is AWS S3 Tables, which were recently announced at reinvent.

As soon as that new shiny toy was released, I did a deep dive, mostly technical, into the how and what of this addition to the data landscape. You can read about me setting up a brand new AWS S3 Table from scratch here.

I will give you the bullet points from that initial poking around at S3 Tables …

AWS S3 Tables are made to be exclusively used with AWS Products like Glue, Athena, EMR, etc.
It’s not made to be “open” in any sense of the word.
S3 Tables lack general query engine and interaction support outside Apache Spark (and favor EMR heavily).
It’s a little bit of a black box.
- Don’t be tricked by the “we used Apache Iceberg format” comments.
Because of “auto-maintenance” and other factors, S3 Tables is a VERY expensive way to attempt to build a Lake House.

Lest you think I have an axe to grind, I do not. I examine tools from both an Engineering AND high-level Architecture viewpoint. I ask myself, “Could one, and SHOULD one, adopt this tool and build a Lake House on top of it?

I’m going to save you some trouble and say No.

Why, you ask? Because it clearly lacks technical depth and support for general-purpose data engineering, which is required to build a Lake House. Go back and read my technical review, or play around with them yourself, you will come to the same conclusion.

Also, there is another more philosophical and arguably more important reason.

Thanks for reading Data Engineering Central! This post is public so feel free to share it.

The Lake House is distinguishable from a proprietary Data Warehouse because the Lake House is open. It is built on open storage, and since this storage is open, you can choose which query engine to use. Crucially, the Lake House requires both an open storage system with open table formats. Without both, you are locked into a proprietary platform.

Take a minute to think about that carefully.

Part of our job as Data Platform builders and engineers is to see the big picture. We are critical of the tools and systems we use and build on top of.

The Modern Data Stack by definition is made up of a variety of tools that can seamlessly talk and work together to move, store, and transform data as needed.

Amazon S3 Tables aren’t S3 or a data lake, they are in fact a proprietary table API for reading and writing tables; not reading and writing files.

Amazon S3 Tables can only be accessed by AWS services. Anyone wanting to integrate with them must adopt their proprietary API through a 3rd party connector (here). They are not following open source Apache Iceberg standards.

Consequently, there is no support for Databricks, Snowflake, Starburst, …, and I doubt there ever will be.

Thanks for reading Data Engineering Central! This post is public so feel free to share it.

But what about their built in maintenance and performance claims? Nonsense! The maintenance is nothing more than the same basic, unintelligent automatic compaction and file deletion / clean-up they already offer with Glue Catalog. You can turn it off, but you can’t use anything better… because Amazon S3 Tables are not open to other engines.

As for the performance, there is nothing here that you can’t get, and more, with other engines with OSS Iceberg + S3. Everyone can 10x their S3 performance with prefixes (see the S3 user guide). Look, I am all for competition in the Lake House market. May the best open Lake House win.

My ask to all the vendors is: Keep to the Lake House architecture, keep the storage open!

Don’t lock me into your proprietary tables!
I don’t want to go back to data warehouse lock-ins and price hikes!
I want the Lake House to stay open and competitive.

Thanks for reading Data Engineering Central! This post is public so feel free to share it.

Constantine

Dec 20

> They are not following open source Apache Iceberg standards.

I'm not sure Iceberg follows its own "standards". Up to this day the only fully-featured "standard" in Iceberg is the Java implementation for Spark. So?

I really hope AWS, being a behemoth, at least will standardize the "standard" Iceberg.

Expand full comment