Review of Mage.ai (data pipelines) for Data Engineers.

Can it be? Another pipeline tool? Why yes, yes it is.

Dec 20, 2022

I know you are all gasping and covering your mouths in astonishment and disbelief, hardly able to contain yourself. Another data pipeline tool has arrived at your doorstep, promising to solve every known problem you have, and ones you don’t.

“Part of my job for your my dear readers is to poke at new tools for you, as I’m sure you have many more important things to do.”

On a more serious note, I think keeping on top of changes in the Data Engineering world is an important part of our job, for all of us. Part of our value comes from knowing the tooling landscape and making decisions about our Data Stack with a full understanding of what are the best options available.

Today we are going to review something new to me, and to you, mage.ai. To be clear I have no relationship with mage.ai, I speak the truth as I see it.

Why mage.ai?
Overview and concepts.
Installation and usage.
Closing Thoughts.

Why mage.ai?

Well, to be honest, I’ve seen this name, mage, pop up a few times in my various feeds. Like most people, I simply roll my dreary eyes and move along to the next argument about Data Mesh. I thought to myself, just what we need, another Airflow replacement, as if Prefect, Dagster, and the rest weren’t enough.

Then something caught my eye, something different.

“Mage is an open-source data pipeline tool for transforming and integrating data.”

Well, I'll be, no mention of pipeline orchestration. The word that appears on the front page of both Dagster and Prefect’s websites ... orchestration, which I think was a mistake by them. Airflow does orchestration fine, that’s why it’s winning.

I’m glad to see someone thinking outside the box and talking about data transformations and pipelines, which Data Engineers do all day long.

Overview and Concepts of mage.ai

Ok, so one of the most tricky parts of any new tool is really getting to the bottom of it and figuring out what really is it. I’m going to first just list what I think mage.ai, as well as simply list the features and functionality they list in their website and documentation.

We will poke at things later when installing and using it.

mage.ai appears to be a Python package (pip installable).
mage.ai appears to be a bunch of decorators on top of your normal Python code.
mage.ai appears to focus on Notebooks for easy development.
mage.ai appears to reference that it scales easily (whatever that means).

There are also a plethora of other ideas and concepts that mage.ai lists themselves.

“Every block run produces a data product (e.g. dataset, unstructured data, etc.)”
“Data validation is written into each block and tested every time a block is run.”
“Operationalizing your data pipelines is easy with built-in observability, data quality monitoring, and lineage.”
“Each pipeline and data product can be versioned.”
“Backfilling data products is a core function and operation.”
“Transform very large datasets through a native integration with Spark.”

And that isn’t everything, kinda overwhelming a little bit. Well, at least it makes one thing clear. Mage.ai is truly trying to be a data pipeline tool, as it appears that’s what all the features are about.

Mage.ai Concepts.

Maybe we can get a better understanding of what we are in for by trying to understand the main concepts of mage.ai, as they present them.

These concepts are surprisingly straightforward and will make sense to anyone used to writing data pipelines.

“They don’t in themselves create some grand new idea that is transformational to data engineering pipelines. So mage.ai must be focusing on the ease of execution and development with their tool.”

Maybe mage.ai is just trying to do pipelines better generally? Not sure yet. Although, there were a few concepts tucked in there that did jump out at me as unique.

Blocks produce Data Products that can be versioned, partitioned, or backfilled.
Blocks produced trigger validations to run on the Data Block.

Data Pipelines being broken up into Blocks is nothing new, but capturing versioning, partitioned, or backfilling of the output automatically, and running validation on the Product is a new sort of implementation. Time will tell.

Installation and Usage.

I won’t really spend much time on installation, you can simply use pip to install it, or use Docker, which is a nice touch.

Side note, when pip installing mage-ai I kept getting errors about pandas

ERROR: Failed building wheel for pandas Failed to build pandas ERROR: Could not build wheels for pandas, which is required to install pyproject.toml-based projects

Although I could pip install Pandas successfully on its own in the same environment, mage refused to install. Maybe it’s just my M1 Mac, who knows, you can always use the Dockerfile I guess.

Notes on Usage.

So if mage.ai is a data pipeline tool, we should probably write a pipeline eh? One thing becomes clear right from the start that I’m not the biggest fan of. Notebooks. I’ve seen Notebooks too many times, and they are always abused, but I get it. Shortens the learning curve and all.

It does tell you something about who the target audience is though.

They are great from a development point of view, but inevitably they turn into something else, an excuse to be messy.

Mage.ai can be installed via pip (if it works for you).
Use the provided Dockerfiles.
Get ready to Notebook (or work your way around them like me).

If mage.ai is made for Data Engineers, they should focus on documentation and usage that is centered around the best practices they care so much about, for example, a good plugin with PyCharm or VSCode.

A Little Note Before We Start.

I want to make something clear, I don’t try to be unkind in my review of tools. I am simply an engineer, starting from zero, with no knowledge, and trying to ramp myself up onto a tool.

I comment about my experience along the way. Who doesn’t run into problems when trying out a new tool? The whole point of it is I struggle, so you don’t have to, and get a better understand of a new tool in the process.

It’s possible my struggles say more about me than a tool, but there is value there.

Example mage Pipeline.

I have 20GB worth of data from my previous post on DuckDB, we will write a mage pipeline to process this data. The data is from Backblaze Harddrive failures, Q1, Q2, and Q3.

Since mage supports Spark, we will attempt to write a PySpark pipeline with it, and see how things go. The first command I ran init’d a new project and started the Notebook.

mage start mage-test-spark

You will then end up with something like this in your browser. So it’s clear from the start mage.ai is at least partly a visual Pipeline development tool. That will become clear in the rest of the example walkthrough, there are no CLI commands to run, it’s the GUI my friend.

Once you click to create a new Batch pipeline, you are met with this wizardry.

As you can see there is a number of Blocks, you can select from at the top. Rember these Blocks map to sections of code that produce a Data Product.

I tried to select PySpark instead of Python at the very top and was met with a giant error. No surprise for a new tool. I figured maybe mage is having trouble finding my local Spark install.

I tried running the same command from the official mage Dockerfile as well and got a different error, it’s looking for boto3 credentials just because I want to write a PySpark pipeline locally.

So strange. Then I read the mentioned documentation, Lord save us all, it wants me to set up an EMR cluster! By George, I will do no such thing! Talk about expensive just to develop a PySpark pipeline locally! Lest you think I lie … “steps required to use Mage locally with Spark …”

They should have at minimum PySpark pip installed in their Docker at a minimum and Spark available inside the Docker image itself to help with local development. I should be able to select PySpark and write a Spark pipeline locally without needing credentials to a remote cluster.

Below is my understanding of the data Pipeline workflow in mage.ai so far.

Trying to move forward.

At this point, we still have yet to write a Pipeline with mage, but let’s just go with a simple Python pipeline. It was quite annoying to be sure, I clicked Data Loader as my first Block from the UI and it generated a bunch of code to read a local file.

Not much use to me, someone who wants to write my own Python script to read multiple CSV files!

“Something is becoming clear to me, Mage is focused on being a cross between a GUI and pure code data transformation tool. Its trying to do all the mundane parts of Data Pipelines for you, leaving you to fill in the Blocks (code), while it takes care of the rest.”

I finally switch the the “Scratchpad” to write what I hope is my own code to load the 20GBs of CSV files.

I wrote two Blocks in what mage calls a Scratchpad, I will try executing them next. Nothing happened. Then I read the docs again, oops. “Use these blocks to experiment and write throw-away code. Scratchpad blocks aren’t used when executing a pipeline.” Dang it.

I then re-wrote my two Scratchpad blocks into …

Generic Data Loader Block.
Generic Transformer Block.

I ran the Pipeline and got an error that Polars was not installed, of course, so I added it to a requirements.txt file and restarted the kernel. All that without knowing anything about the tool, I suppose that says something about its intuitiveness. Still no luck.

It appears mage keeps clearing out the requirements file every time I touch it, or restart the kernel. Not really sure what to do. Do you really have to specify all your requirements before you start a project? That seems probably not true, but I’m stumped on how to add a pip installable package while developing a pipeline.

I finally found the answer, by accident, I can run the pip install from a “Scratchpad.” Yikes.

The problem now is the Scratchpad defaults to Python it appears, and of course, errors when trying to run a pip install command. For the life of me, I can’t figure out how to switch it to a terminal.

Since I’m running this in Docker I simply run the suggested command to connect to the Container, and pip install it via the cmd line.

This time the Pipeline executes when told to do so in the UI, but I get another error. Something about a Circular reference? I know this code works, but it runs fine outside of mage.

At this point, I just want the pipeline to run so I can see it with my own eyes. I try combining both Blockz into a single Block that should just execute. This time it worked.

I think I’m starting to get the feel of mage. You’re probably not going to like what I’m going to say next, but hear me out. I’ve had this sort of feeling before, I’ve felt this sort of development before … can I put my finger on it? Why yes I can. It’s called Microsoft’s SSIS.

Let’s Pause.

Before everyone comes at me with pitchforks and torches, let me say my piece before you make a martyr of me.

“Some of the most sucsefful and popular Data Pipeline tools every created and still used today are a marrige between visual GUI and code. This is every true to some extent for Airflow.”

I know we’ve barely scratched at the surface of mage, but you can see it can’t you? Putting easy-to-understand abstractions on top of all the familiar concepts of Data Pipelines? This is classic.

After all, mage is a Data Pipeline tool, and it’s extremely clear from what little we have seen of it, that this is exactly what mage is trying to make easier.

Closing Thoughts.

I’ve been long-winded already, let me try to give some more closing commentary on mage.ai if you please. There are a few things that come to mind.

They talked a lot about Validation at each Block and Data Product produced. As for validation what they mean is that when you generate a new Block it auto-generates a test function for you.

While this is helpful and leads in the right direction, it seems a little misleading. Most of us are already writing unit tests for our code.

Also, after my problems trying to write a local PySpark pipeline, I’m a little suspicious of their integrations and how much of it has been tested in real life. Let’s be honest, the number of pure Python data pipelines is dropping by the day. Data gets bigger and more complex, which requires most pipelines to run on Redshift, Databricks, Snowflake, DBT cloud, and the like.

This is why Airflow is so popular, it connects to anything with ease.

But, there are some bright spots after trying out mage.

Mage is so much more than Airflow, it can do what Airflow does, but better and with more features.
Mage takes away a lot of the mundane parts of developing data pipelines.
Mage was able to but easy to understand and use concepts on top of pipeline components.
Mage is the perfect marriage of GUI and code, that much is clear.

“I can say even after just trying it once, mage would help any Data Engineering team write uniform, clean, well tested Data Pipelines. This is NOT something found in Airflow, Prefect, or Dagster.”

If I was running a team of junior engineers, or just a very large team of data engineers writing all sorts of pipelines and I was looking for uniformity and a tool that will solve many of the common problems, it would be Mage without a doubt.

The last time I saw a tool like this was SSIS, no joke. Does mage have a bright future? I have no idea. Probably, but who knows? I’m a fan. I’m going to watch them closely in the future to see where the tool goes.