Data Engineering Central

Data Engineering Central

Share this post

Data Engineering Central
Data Engineering Central
Partitions in Distributed Compute.

Partitions in Distributed Compute.

finding out the unknown

Daniel Beach's avatar
Daniel Beach
Mar 21, 2025
∙ Paid
8

Share this post

Data Engineering Central
Data Engineering Central
Partitions in Distributed Compute.
1
Share

The topic of partitions, both in memory and disk, in our distributed computing world seems to get little attention these days. With the rise of Vibe coding and the never-ending quest to abstract things away, it will only get worse.

Yet, no matter what “they” do, understanding the fundamentals of distributed computing and how it relates to our datasets will separate the good, bad, and ugly Data Engineers.

Let me ask you a question: Staff Engineer, Senior Engineer, Junior Engineer, whoever you may be. Let me ask you a few questions.

  • What size of parquet files will your Spark jobs write into your Lake House?

  • What is the optimal parquet file size in your Lake House?

  • How do raw file types and sizes read by distributed compute look as partitions in memory?

  • How does the layout in memory of your distributed datasets affect the number and size of files at the end of the data when written?

The simple truth.

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 dataengineeringdude
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share