The topic of partitions, both in memory and disk, in our distributed computing world seems to get little attention these days. With the rise of Vibe coding and the never-ending quest to abstract things away, it will only get worse.
Yet, no matter what “they” do, understanding the fundamentals of distributed computing and how it relates to our datasets will separate the good, bad, and ugly Data Engineers.
Let me ask you a question: Staff Engineer, Senior Engineer, Junior Engineer, whoever you may be. Let me ask you a few questions.
What size of parquet files will your Spark jobs write into your Lake House?
What is the optimal parquet file size in your Lake House?
How do raw file types and sizes read by distributed compute look as partitions in memory?
How does the layout in memory of your distributed datasets affect the number and size of files at the end of the data when written?