We know all good developers love a battle. A fight to the death. We love our files. We love our languages, and even when we know what the answer will be, it’s still fun to do it.
While this topic of JSON → Parquet may seem like a strange one, in reality, it was one of the first projects I worked on as a newly minted Data Engineer. Converting millions of JSON files stored in s3 to parquet files for analytics, so maybe it isn’t so far-fetched after all.
When dealing with massive datasets, the storage format can significantly impact performance. While JSON is an excellent medium for data interchange, Parquet, a columnar storage format, is more suitable for big data analytics due to its efficiency and speed.
You should check out Prefect, the sponsor of the newsletter this week! Prefect is a workflow orchestration tool that gives you observability across all of your data pipelines. Deploy your Python code in minutes with Prefect Cloud.
In this article, we'll compare converting JSON to Parquet using Python and Rust, highlighting their performance and idiosyncrasies.
Every pythonista secretly wishes they were writing Rust instead, right? Think of all that performance, the strong typing, braces around blocks, and most importantly the street cred. Except… is it really true we’ll see performance gains if we make the switch?
Python: Using PyArrow and Pandas (and maybe Polars)
First, we are going to start with Python. Python is one of the most popular languages for data manipulation and has a vast ecosystem of libraries.
The next best thing to use in the Python data world is PyArrow, a cross-language development platform for in-memory analytics, that provides an easy way to convert JSON to Parquet.
Apache Arrow powers other popular libraries like Polars.
For our sample dataset, we will be using a credit card fraud table that can be found on Kaggle.
We will be reading this JSON file and converting it to a Parquet file.
One thing to note is the utter simplicity of this Python code. It runs a little slow at over 4 seconds.
This is probably cheating, but what if we just used Polars (the Python bindings for the Rust library) to do the same thing, will it be simpler and faster than Pandas + PyArrow?
Indeed it is simpler and much faster, almost down to 3 seconds.
On a side note, this shows the power of Polars and how it might change the Data Engineering space. A task that once required Pandas and PyArrow can now be done quickly and more simply with Polars.
Advantages of Python:
Direct integration with Pandas and Polars, a primary tooling for data manipulation in Python.
A mature ecosystem with vast documentation.
Very terse and readable code
Drawbacks of Python
Might be slower compared to low-level languages like Rust for vast datasets
This is just yet another reason to switch from Pandas to Polars. No excuses.
Rust: Using Arrow and Parquet Crates
This should be interesting. We already saw we could move from Pandas in Python to Polars (Rust) and got a good little speed-up bump with some very simple code.
What will Rust look like? No doubt the code will be more verbose and complex, but will the performance make it worth it?
Rust, known for its memory safety guarantees and blazing-fast speed, has libraries to handle Parquet and Arrow operations, making it a strong contender for large-scale data processing tasks.
It was discovered during this experiment that the JSON reader for Rust’s Arrow implementation expects a text document where each line is a row of JSON, rather than a single JSON array of objects.
From the [crate’s documentation]
>> This JSON reader allows JSON line-delimited files to be read into the Arrow memory model. Records are loaded in batches and are then converted from row-based data to columnar data.
Going into this blind, this slowed me down, and should factor into the total effort involved. This can probably be circumvented by parsing the JSON document using some other mechanism, but perhaps with a significant performance impact depending on how that mechanism works.
It’s a little daunting to see the complexity of the Rust required to do the same job Python did.
It was also as slow, and slower than the Polars Python example. Part of the reason for the complex Rust was the inability of json_serde to read an Array of JSON objects with no newline characters.
So the above file had to be preprocessed into a file that could be handled better.
Rust Advantages
Typically faster due to Rust's efficiency.
Memory safety, ensuring that there aren't unwanted surprises during execution.
Rust Drawbacks
Steeper learning curve if you're new to Rust.
The ecosystem is less mature compared to Python's, but it's growing fast.
Rust’s implementation of the JSON reader is fickle and expects newline-delimited files rather than a conventional JSON array.
The Truth about Rust vs Python for JSON → Parquet.
Let’s be honest. It wasn’t what I or anyone else probably expected. Are there some Rust savants who could probably eventually write code that is faster and better? Of course.
But, that’s not the point. We learned something important.
If you can write 4 or 5 lines of Python code that runs faster than spending a TON of time writing Rust that ends up being slower … well, sometimes it’s about more than just the obvious.
Having less code to manage and debug, in Python’s case, can make it a clear choice for some. Especially if we can just use Rust via something like Polars in Python.
It’s important to know when to use what tool for the job, that’s for sure. Of course, there is only one way to learn this stuff, by trying it out!
The real lesson is to carefully consider your storage format in the first place.