Sometimes, we should talk about the boring, even though we don’t want to. It’s part curiosity, part ritual to see if anything has changed much. We all get stuck in our old ways; learning new tricks is hard.
Part of this was spurred by some recent chatter online about PySpark/Spark and how people SHOULD be adding new columns, per all the internet savants.
Honestly, I’m not sure if I care that much anymore. At a certain point, I outgrew the need to optimize every single piece of code … unless that code was causing problems and NEEDS to be optimized.
When you are young and full of spit and vinegar, as you well should be, you want to take on the coding world. Every single piece of code is agonized over, optimized, and squeezed for every drop of bits and bytes until it bleeds.
Thanks to Delta for sponsoring this newsletter! I personally use Delta Lake on a daily basis, and I believe this technology represents the future of Data Engineering. Content like this would not be possible without their support. Check out their website below.
It’s about priority, working on what the business needs, providing value, and not doing whatever whim strikes you on that day because you had too much coffee.
Did you like my rant? Sorry, not sorry.
Ok. Columns.
I mean, if you grew up listening to Limp Bizkit like me, then there is a good chance you are old, wizened, and raw dogging Spark pipelines using .withColumn(), like tomorrow is your last day on earth.
This is the most classic use of adding columns. It is popular because it allows you to easily add complex business logic without a lot of fluff. Of course, this is what all the smart engineers will tell you is worthy of getting your knuckles cracked by your grandma.
After getting a good wrap on the hands, you quickly move on to using .withColumns() …
Even though it looks kinda messy, we do what we have to. Then, if you are feeling spicy and just want to mess with someone … you whip out a little .selectExpr() …
I mean, aesthetically, this is probably the most pleasant option for people, although it is the least used because it looks more like SQL!
Every once in a great while, you might see something strange.
Let’s be honest; most hobbits do spark.sql(““) and are never seen again.
Polars and columns.
The new kid on the block, Polars, simply follows the Spark patterns that came before them. The .withColumns() combined with Expressions.
I guess that most people probably don’t like this style; it seems that people avoid expressions. People probably like the approach better.
You can do the classic select + expression in Polars as well.
Even Polars as SQL at this point.
I think what it comes down to is if you are a DataFrame or SQL person. What does most of the codebase consist of?
If we ask ChatGPT to summarize it all for us, it pretty much tells us the same thing we’ve looked at, and knowing it’s training data (everyone’s GitHub repos), you can get a good idea of what is popular.
Sure, the talking heads want us to always use .withColumns etc, but honestly, who cares? If you’re literally adding 200 columns to a dataset, then you’ve got other problems besides how you decide to add them.
I like the readability option when it comes to choosing. The selectEXPR of some sort is my default; it’s a good mix between the verbose DataFrame option and the classic SQL-readable code.