7 Comments
User's avatar
Fahim Kanji's avatar

Did you try this at scale? What about the insert / merge performance? I'm curious to know the differences as well because I'm of the same mind that a measure of control on your partitions is usually preferred over letting a machine figure it out dynamically all the time

(also - it's cheaper as we don't have to pay to constantly optimize our data).

While your test is arbitrary and not necessarily demonstrating a real-world use-case and performance comparison, I think if the scale is indicative of anything, I could argue that nobody queries "the whole table" like this and that, in fact, they do so on very specific time-ranges of interest (i.e. the last 7/14/30 days). if one were to argue that the real-world query volumes are 10% or less of the total volume, a linear performance profile would result in manually partitioned and liquid clustered performances reaching negligible ranges (i.e. 8 seconds * 10% = 0.8 seconds and 4 seconds * 10% = 0.4 seconds).

One type of value one might get from data being organized in their own partition scheme is that they could can more cheaply and effectively manage archival and migration tasks and with cheaper tools / technologies ! (Azure Functions vs Synapse Serverless SQL vs Presto vs Databricks)

Expand full comment
Daniel Beach's avatar

I work in a ML heavy environment and we daily do massive table reads and computations on 10's of billions of records spanning very large timeframes.

I expect liquid clustering is simply the next generation of partitioning that we will all be using in short order.

Expand full comment
Fahim Kanji's avatar

In your environment do you archive data at all? I guess the impact on archiving will just land in your test criteria to validate against some of the concerns you've already pointed out ("what if a one-off query changes the partitioning thereby creating negative consequences?")

side-questions :

1: since your environment is ML heavy, their need for past state is likely largely fulfilled by delta versioning. How long do you keep delta versions for?

2: Do you have to deal with the need for fully auditable state change or longitudinal analysis (i.e. like type-2) ? How do you provide this? (if needed, I assume it is probably a different dataset)

Expand full comment
Sarthak Nagpal's avatar

I might have missed that in the article but why are we optimizing the liquid partitioned table after every run?

Expand full comment
Daniel Beach's avatar

I simply did because as far as I can decifer statistics are gathered on columns, optimize is non deterministic, and I can only assume you have to run optimize at least daily for the "liquid" in liquid clustering to take place.

Expand full comment
Josiah Johnston's avatar

I really appreciate your posts and dives on these topics.

I have similar experiences as Fahim. I've also seen DBX do unexpected caching tricks that have skewed my benchmarks.

In my experience with big IoT datasets, the major bottlenecks to read speed are generally:

1. Loading indexes for predicate push-down

2. Scanning & filtering data

99% of queries on my datasets are by equipment id(s) and time range. The users responsible remaining 1% of queries (random aggregations, etc) usually have a low bar and are happy if they can get anything working, and don't sweat if it takes seconds or minutes or even hours for big bespoke projects.

In these cases, I classically got best performance for small queries from partitioning by equipment and minimizing file count per equipment. For big high throughput queries that scanned most of the dataset, I got better performance out of spark by lumping several equipment into larger files (ex: 128 MB). Bigger files optimized for high throughput and full scans could be 5x faster than smaller partitions, but small queries would have way lower performance. Conversely, big scans have worse performance with smaller files (many open handles?), but small focused queries do way better due to little disk scans.

Spark/DBX's lack of effective use of parquet row group indexing has been a point of frustration for me, because allows a finer grain of indexing for smaller data scan and lower mem footprint. But I've come to accept they put low priority on fine-grained queries and emphasize high-throughput use cases, so I'm forced to use different tools for different tasks. Also, I think they assume most people have low understanding of data layout and their automation will be better than most. This would be fine if they gave easy button and hooks for visibility & optimization, but they are trying more and more to make their UC managed tables opaque and few hooks for hands-on control. Frankly, it's disappointing because their solutions are often 5-10x worse performing than what I know is possible with more hands-on control. My datasets & use cases may not be their average customer, but IoT timeseries represents a huge and growing chunk of the market.

FWIW, In recent months, I've seen ridiculously bad performance and data layout with Liquid Clustering by equipment id & timestamp. I've gotten decent results with LC by equipment id with smaller target file size and frequent optimize. I still wish they wouldn't always randomize row order (counterproductive for most downstream timeseries operations), but that's more a compute cost for sorting than a data read cost.

Expand full comment