The infinite loop. Nothing screams the beginning of a programmer’s journey more than the classic stack overflow. Loops. I still remember writing some of my first loops in PHP and Perl. Wait. Did I say that out loud? I just dated myself to a specific decade probably.
There are probably few things more common to generic Data Engineering tasks than ye ole’ for loop. We’ve all looped a CSV a few times in our lives have we not?
Funny, leave it to a bunch of programmers with decades with nothing better to do, wa-la, now we have a million ways to loop.
map
filter
reduce
while
for
something else?
Ultimately, we have an iterable, and then we must do something with that iterable. That’s life.
Today I want to talk about iteration. Simply for fun. Python. Rust. Maybe you will try something new. That’s the goal.
Thanks to Delta for sponsoring this newsletter! I personally use Delta Lake on a daily basis, and I believe this technology represents the future of Data Engineering. Check out their website below.
Programming Style, or more?
Who decides what to use, a for loop or a map?
It is just the way we were born? What language we first used? Who taught us? Blind luck? Our loosely held opinion on clean and readable code? What do we think is the fastest? Probably all of the above.
I still remember when list comprehension became a thing in Python. Boy, did I go overboard. I could write list comprehensions 3 layers deep with a lambda thrown in just for fun. Those were the days.
I’m personally not a big stickler when it comes to how people write code. As long as it’s legible, has been run through something like ruff or black (for Python), and functions don’t contain more than 20 lines of code max … I won’t complain.
Just for fun.
Let’s solve a problem, and try a few different ways of iterating. Simple problem. Let’s convert a CSV file to a Tab flat file. Iterate the rows, but do it a few different ways, in both Python and Rust.
Let’s just simply get a feel for what are options are and how they fit … like a pair of shoes, sometimes you just know if you’re going to like something or not.
We are going to use a single file from the Divvy Bike Trips open data set. Pretty typical, looks something like this. The file I have has about 770K records.
Let’s say we work in some place old and crusty. They want to move everything backward in time, they enjoy the old stuff. As such, we get CSV files on a daily basis they want to be converted to tab delimited so their old mainframe can consume them.
So be it.
With Python I say.
First, we are going to do the task with Python. Let’s be creative and find a few different ways. Let’s time the results as well, just for fun.
I mean it’s Python. There are probably 50 ways to solve this problem with it. Writing Python is like the twilight zone, anything and everything is possible, all at the same time.
Let’s start with a simple for loop.
It works perfectly fine, and converts the file, although at quite a snail’s pace 0:00:03.925561.
This is probably the code we are all used to seeing. Simple for loop, very easy to read and reason about.
Let’s try to do this with map.
Well, we can say this is less legible, takes a second glance to figure out what is happening to someone seeing it for the first time probably.
It’s also notably slower 0:00:04.148699.
Just because we are gluttons for punishment, and Python is the language of the masses. Let’s do this with a filter. Strange eh?
We are on a roll, getting slower every time, 0:00:05.695858. This is expected having to call our naked filter function on every row.
As if we aren’t twisting the laws of Pythonic Python enough, let’s do this with a reduce function.
Reduce is faster than the filter, at 0:00:04.002285.
And if you thought we were done yet. Think again. We have the while loop.
And who would have thought, the slowest one yet! 0:00:06.743144
I do have to say, the filter, reduce, and map functions are probably the most confusing simply to read and understand, that is if they were in the context of a larger codebase, it would take some mental overhead when you run across code like that.
This is sort of sad to all of us people who like to act smarter than we are. Guess we are stuck with the old for loop.
I suppose this simply means that trusty arrays, bare-bones style, are just good for the fast stuff, syntactic sugar is fun, but incurs a penalty, especially when applied incorrectly.
Doing stuff in Rust.
Let’s do some of the same loops in Rust, and see what happens. We shall again start with a simple for loop. This looks very similar to the Python.
Gandalf’s Beard! That Rust is fast. Duration { secs: 0, nanos: 787616000 } Well under a second.
Makes you wonder why we don’t use Rust for more Data Engineering tasks. Easy to read, even for those who aren’t familiar with Rust, hard to miss a for loop.
While we are at it, let’s try the map function in Rust.
I don’t mind the map function here, it doesn’t ruin the readability that much. It’s still apparent what is going on. Little slower than the for loop, but still blazingly fast compared to Python.
Duration { secs: 0, nanos: 851033000 }
Thoughts on complexity and loops.
For some reason I think, as humans and engineers, we can fall into the trap of falling in love with complexity. Especially early on in our careers, we can equate complexity with genius. This is far from the truth.
It’s usually the opposite. And we find this truth played out in our examples of both Python and Rust.
We all loop. All the time.
Don’t you feel dirty simply writing a simple “for” loop sometimes? Like you should be doing something more fancy or complex?
Let this be a lesson for us all. In the words of Led Zeppelin, “Not everything that glitters is gold.” There ain’t nothing wrong with a good ol’ for loop, don’t let anyone else tell you otherwise.