5 Comments

Did you try this at scale? What about the insert / merge performance? I'm curious to know the differences as well because I'm of the same mind that a measure of control on your partitions is usually preferred over letting a machine figure it out dynamically all the time

(also - it's cheaper as we don't have to pay to constantly optimize our data).

While your test is arbitrary and not necessarily demonstrating a real-world use-case and performance comparison, I think if the scale is indicative of anything, I could argue that nobody queries "the whole table" like this and that, in fact, they do so on very specific time-ranges of interest (i.e. the last 7/14/30 days). if one were to argue that the real-world query volumes are 10% or less of the total volume, a linear performance profile would result in manually partitioned and liquid clustered performances reaching negligible ranges (i.e. 8 seconds * 10% = 0.8 seconds and 4 seconds * 10% = 0.4 seconds).

One type of value one might get from data being organized in their own partition scheme is that they could can more cheaply and effectively manage archival and migration tasks and with cheaper tools / technologies ! (Azure Functions vs Synapse Serverless SQL vs Presto vs Databricks)

Expand full comment

I work in a ML heavy environment and we daily do massive table reads and computations on 10's of billions of records spanning very large timeframes.

I expect liquid clustering is simply the next generation of partitioning that we will all be using in short order.

Expand full comment

In your environment do you archive data at all? I guess the impact on archiving will just land in your test criteria to validate against some of the concerns you've already pointed out ("what if a one-off query changes the partitioning thereby creating negative consequences?")

side-questions :

1: since your environment is ML heavy, their need for past state is likely largely fulfilled by delta versioning. How long do you keep delta versions for?

2: Do you have to deal with the need for fully auditable state change or longitudinal analysis (i.e. like type-2) ? How do you provide this? (if needed, I assume it is probably a different dataset)

Expand full comment

I might have missed that in the article but why are we optimizing the liquid partitioned table after every run?

Expand full comment

I simply did because as far as I can decifer statistics are gathered on columns, optimize is non deterministic, and I can only assume you have to run optimize at least daily for the "liquid" in liquid clustering to take place.

Expand full comment