You know, I’ve done my fair share of DSA (Data Structure and Algorithms) nonsense over the years. In fact you can read about some DSA on this very Substack …
DSA For The Rest Of Us (Part 1) - Linked Lists
DSA For The Rest Of Us (Part 2) - Binary Search
DSA For The Rest Of Us (Part 3) - Quick Sort
I honestly am ho-hum about the whole thing. If you are looking to learn mostly useless things and feel better about yourself, go ahead and spend your time in fantasy DSA land.
But, I have a different idea.
How about you learn something that you will actually have to do a million times over in your career … namely String Manipulation? I mean what could be more boring uh?
Plain old String Manipulation.
The problem is that anyone in Data Engineering, or the Data Space for that matter, spends more time than they would like to admit manipulating Strings in all manner of ways. You probably do it so much that you don’t even think about it anymore.
Thanks to Delta for sponsoring this newsletter! I personally use Delta Lake on a daily basis, and I believe this technology represents the future of Data Engineering. Check out their website below.
Strings, Strings, Strings.
If someone had told me before I started this all, when I was playing around with Perl, that low these many years and decades later, while having an illustrious career, designing Data Platforms and Machine Learning platforms … that to this very day, I would still find myself munging around Strings … well I don’t know.
What’s a person to do? The best you can.
So, let’s just have some fun today. Let’s do some typical String manipulation in both Python and Rust (just because we can), and hopefully help those Young Bloods get a taste what what their future looks like.
Sample Problems to Solve.
Let’s start with some of the most common problems of all with Strings in a Data Engineering context, and just see where it leads us.
One place we’ve all been before is `s3` URIs, basically cloud storage file location. Oh, Lord knows how many times we’ve manipulated cloud storage paths.
You would think we would have written general functions to do this stuff for us by now, but I guess we are gluttons for punishment.
Let’s just solve some problems and see WHAT it is exactly we have to do with Strings afterward.
Separating Strings … aka Splitting them.
Finding the last occurrence of a String in a String
Finding the first occurrence of a String in a String
Building Strings from other Strings
I mean that’s what it all boils down to, isn’t it?
How can we split the s3 bucket apart from the s3 key?
Easy stuff in Python uh?
The main feature is splitting a String. The concept of splitting a String is extremely common in most languages. In the case of Python, we get a List/Array back and we can take the first result in the List.
Basically, we are saying split(‘/’) the string on the / … give me the stuff on either side.
Split, split, split all day long, split, split, split while I sing my song. What, your Grandma never taught you that one??
What about Rust? I’m going to do this in a very verbose manner … no syntactic sugar.
I mean the cool kids (of which I was never one) in Rust would probably write …
Again, this splitting a String concept is pretty much identical in Rust and Python. We can split on some specific char, and then grab either the front or back from the resulting Array/List.
Notice in both answers, Python and 2 Rusts, we also did a replace, of some part of the string with Nothing. This replace feature is something that is used consistently in Data Engineering when manipulating Strings.
Finding the last occurrence of a String in a String
What else can we do with Strings that happen often? How about the classic finding the last occurrence of a String in a String? The practical application would be getting the unknown filename out of a s3 URI.
In this case, all we know is we need to find the last /, and everything else afterward is the String … filename we are after.
(There are actually shortcuts to this answer using split … but we will pretend that doesn’t exist for now)
Here’s some Python.
Here’s some Rust.
Again, these are extremely similar answers. Both provide a rfind() method that is very helpful indeed. Also, once the location or index of the String within the String is found … we use that to “slice” into the original string at a specific location and “take” the rest.
Things of Note
I think it’s important to stop here, for those newbies, who maybe haven’t worked a ton with Strings and talk about Strings as Arrays or Lists.
This ain’t no Computer Science class, so take your comments elsewhere, but you can think of a STring as a List or Array of index locations, each location with one char, of which in total, make the entire String.
This is what is being done with the above solutions, we are finding the index of the last /, and then slicing into the array to take the FOLLOWING index until the end of the array.
Building Strings
Probably the most mundane topic for today, and the easiest part, building strings. Those who’ve been around the block have done this a million and one times … but let us not forget our younger counterparts.
When we are splitting strings, finding occurrences, or pulling apart strings … usually the next thing we do is put them back together again, like Humpty Dumpty.
Let’s say we want to pull the date parts out of an s3 URI because that’s our only option, let’s do that and put the date parts back together again.
This answer sort of combines some of the topics we’ve already talked about … mostly we reuse the idea that Strings are just Lists/Arrays of chars at different indexes.
And the same thing in Rust.
Amazing how similar the two solutions are eh?
Thoughts on String Manipulation
Nothing much has changed all these years of Data Engineering, all the comings and goings of technologies, companies, whatever. And here I am, still piddling with Strings these many moon later.
I think in the beginning, when we first start working with Strings, things are unfamiliar. I remember those days, you do the best you can.
We forget that we can even interact with Strings as Arrays/Lists. We forget about all the nice built in functions … split … rfind … etc.
Let me know what you do with Strings!!