Why is everyone trying to kill Airflow?
Apache Airflow has been the ruler of Data Engineering orchestration for years, is the end in sight?
The end is near, the end is near!! I feel like someone is constantly waving one of the signs about the end of the world coming and yelling to take cover, and that end of the world is the demise of our beloved Apache Airflow.
Let’s peer forward into the cloudy and murky future, divine the tea leaves, and look into the crystal ball. Is it really the end of the age of Apache Airflow? Dagster is here, Prefect is here, Databricks Workflows, everyone is trying to throw Airflow from its lofty throne.
Is there something(s) wrong with Airflow?
I’m going to start with the conclusion, that Apache Airflow is far from dead, in fact, it probably has a bright future for a long time to come. I’m calling no-go on the doomsday proclaimers who are out there peddling the end of Airflow as we know it is near.
“Anyone who thinks Airflows days are numbered has been reading too much marketing material and isn’t paying attention to the reality in the wider Modern Data Stack.”
There is nothing fundamentally wrong with Apache Airflow, it’s a first-class Data Engineering tool that has been driving data pipelines forward for a long time. And the adoption of this tool is NOT slowing down.
Why then are people calling the end of Apache Airflow?
I’m going to make a proposition to you, the average Data Team or Data Engineer is not calling for the demise of Apache Airflow. There are new SAAS companies and marketing teams calling for the end of Airflow, and there is a difference between the two.
“It isn’t really that Airflow is coming to an end, it’s that Airflow has reached critical mass and now competitors are chopping at the feet, taking market share.
The more people that use Airflow the more certain rocks come out of the water, because Airflow is being pushed to its limit.
And that’s a good thing.”
Airflow is taking head because there are two newcomers to the field of pipeline orchestration, Prefect and Dagster. This is only natural, it’s how the free and open market operates, even with open-source software.
When something, like Airflow, becomes extremely popular because it’s doing a task so well and filling a need, there is always someone or something that is going to come along and say, “I can do better than that.”
There is nothing better for Airflow than stiff competition. It drives the community at large to improve and innovate.
How can we really know there is a solid future for Airflow still?
This is an important question, you don’t want to hitch your wagon to bum oxen if you’re in the market for a new orchestration and dependency management tool. I will give you two very compelling reasons to believe Airflow is here for the long haul.
MWAA (managed Airflow by AWS)
Composer (managed Airflow GCP)
It's a pretty safe bet that if AWS and GCP are willing to go to all the work to offer Airflow as a managed service, it will be around for a while. If only for the fact that they have so many customers and so much of the market share.
When cloud companies make it ridiculously easy to use a tool, you can bet that Data Teams are going to take the bait. Simplicity and keeping tooling on the same architecture is seen as more and more important.
What it really boils down to is that if AWS and GCP are supporting Airflow, it’s here to stay.
What are some things Airflow is bad at?
It's important to recognize that every tool is going to have “areas that need improvement,” also known as areas that it sucks at.
Airflow is no exception to this rule, its shortcomings are what have given rise to the next generation of tools like Perfect and Dagster. Airflow has real problems that can cause serious problems for some data teams.
Doesn't scale well with big data.
UI isn't exactly something to cheer about.
Some folks complain about the verbosity of DAGs.
Running hundreds of thousands of DAGs can be a pain.
Integrating custom pipeline code into Airflow isn't very smooth.
Inter-task communication is a thorn in the flesh.
Some folks complain about the DAG learning curve.
To be honest, you can’t really blame all these things on Apache Airflow, the truth is that Airflow has become so ubiquitous and widely used in the Data Engineering community that it gets abused.
Trying to use Airflow with a bunch of workers to actually do data processing and transformations at scale … is probably not a good idea. You’re misusing Airflow. What you should instead do if you have data of any decent size is offload the compute using some community-provided connector/packages, leaving Airflow to do what it does best, monitor, orchestrate, manage dependencies, and schedule.
“Don’t make the mistake of thinking Airflow is a great data processing tool, in itself. That is not where it shines.”
Sometimes the grass looks greener on the other side of the fence.
It’s hard not to get drawn away by the shiny new tools, we are all human after all. I mean when you’re looking at Airflow’s UI, compared to its direct competitors … it would make you start to wonder.
What is Airflow still good at?
Well, I’m glad you asked! With the number of people and engineers using Airflow around the world you can bet your bottom dollar that there are many spots where it excels and enables Data Teams to do amazing things.
Let’s sing the praises of Apache Airflow for a while, shall we?
Airflow is good at scheduling tasks.
Airflow is good at orchestration.
Airflow is good at dependency management.
Airflow has a massive and very active community.
Airflow has an amazing amount of third-party providers (think Snowflake, Databricks, etc.)
Airflow is very customizable and extensible.
Airflow is supported and managed by AWS and GCP.
Airflow has been around enough to be “hardened” for production.
I will stop there but you get the point.
“If you’re looking for a basic, run-of-the-mill, bulletproof orchestration tool for your Data Engineering work, you would be foolish to overlook Airflow.”
Why is everyone trying to kill Airflow?
I think we found the answer to that. Not everyone is trying to kill Apache Airflow, throw it under the bus so to say. It’s just marketing crap mostly, but that’s ok. Look on the bright side of it all, Airflow is potent enough to bring to life Dagster and Prefect products that want to take bites off its heels.
Dagster and Prefect are wonderful tools, they are pushing the boundaries of what is possible in a data orchestration tool. It’s taking data pipelines to the next level and redefining what is popular and how we all solve problems.
Surprised to not see a mention of Astronomer - they've bet big on Airflow and are contributors to the open source distro as well. They definitely support your argument here!
Cloudera is also betting big on Apache AIrflow by integrating it well with in its Data Engineering stack, taking away some of the pain points mentioned above wrt scaling and verbosity. They are also contributing to the open source comm as well.