There was one comment out there on Reddit that suggested getting contracts implemented required playing an elaborate game of politics that blamed upstream teams for warehouse downtime in post-mortems. Good luck doing that most places without getting yourself thrown in the bin...
Awesome write-up. I agree the idea sounds good, and we, as data engineers, have been fighting with bad data for decades. We just called it schema change or evolution.
IMO, Data quality tools integrated into orchestrators are the way. Especially if the orchestrator is data asset-driven in a declarative way. Meaning you can create assertions on top of data assets (your dbt tables, your data marts), not on data pipelines. So every time a data asset gets updated, you are certain the "contract" (assertions) are true.
I suspect a great way would be a SaaS like Snowflake to create the concept of a data contract object (semantics, more than just columns and data types.. ideally there should be some ISO defining data contracts for most core business entities.
- accessing data always means using an interface or an api, without a data contract even without consent and certainty that the data and its quality will be preserved. an sql interface is still an api.
- data contracts provide certainty as to who in the organisation owns or manages the data or the data source. owners of data are very often not data engineers.
- quality tools should be integrated into data contracts.
- nobody is forcing you to use avro or protobuf.
- nobody is preventing you from using python and sql.
data contracts are a tool to make the use of data more robust between different teams. they provide a framework for discussion.
I also disagree that the data engineering community did not decide yet. I think that the data engineering community is starting to shift and adoptions is steadily growing.
Data contracts enable encapsulation and unit testing. The data engineering community is used to stitching scripts together in spaghetti pipelines. This is ok for small scale. As data teams are getting bigger and building more pipelines, the need for proper software engineering practices increases.
Encapsulation requires a proper API mechanism for data. Data contracts is a valid way to express an API to expose tabular data via SQL. In this sense data contracts are a way to separate internals from the API.
And contract enforcement corresponds to unit testing in software engineering. Remember the days when release cycles are getting longer because we got afraid to release? These are the times when the scale of analytical data becomes bigger than what we can handle with a few scripts.
I think it's early days and that still a lot of data engineering teams didn't reach the scale where they really see the need for applying these software engineering principles. But once you saw them, you cannot unsee them. To use the software engineering analogy: Even when I write a small script now, I still start with a unit test and I still consider the interface carefully. I think it will be the same with data engineering. More and more teams will reach the scale where these software engineering practices are needed. And IMO data contracts will become mainstream in the next years.
For this reasons, we have started to build a data contracts native solution at Soda.
There was one comment out there on Reddit that suggested getting contracts implemented required playing an elaborate game of politics that blamed upstream teams for warehouse downtime in post-mortems. Good luck doing that most places without getting yourself thrown in the bin...
haha
Awesome write-up. I agree the idea sounds good, and we, as data engineers, have been fighting with bad data for decades. We just called it schema change or evolution.
IMO, Data quality tools integrated into orchestrators are the way. Especially if the orchestrator is data asset-driven in a declarative way. Meaning you can create assertions on top of data assets (your dbt tables, your data marts), not on data pipelines. So every time a data asset gets updated, you are certain the "contract" (assertions) are true.
agreed
I suspect a great way would be a SaaS like Snowflake to create the concept of a data contract object (semantics, more than just columns and data types.. ideally there should be some ISO defining data contracts for most core business entities.
For sure, if someone like Databricks or Snowflake came up with an implementation, people would probably use it.
i would like to disagree:
- accessing data always means using an interface or an api, without a data contract even without consent and certainty that the data and its quality will be preserved. an sql interface is still an api.
- data contracts provide certainty as to who in the organisation owns or manages the data or the data source. owners of data are very often not data engineers.
- quality tools should be integrated into data contracts.
- nobody is forcing you to use avro or protobuf.
- nobody is preventing you from using python and sql.
data contracts are a tool to make the use of data more robust between different teams. they provide a framework for discussion.
I also disagree that the data engineering community did not decide yet. I think that the data engineering community is starting to shift and adoptions is steadily growing.
Data contracts enable encapsulation and unit testing. The data engineering community is used to stitching scripts together in spaghetti pipelines. This is ok for small scale. As data teams are getting bigger and building more pipelines, the need for proper software engineering practices increases.
Encapsulation requires a proper API mechanism for data. Data contracts is a valid way to express an API to expose tabular data via SQL. In this sense data contracts are a way to separate internals from the API.
And contract enforcement corresponds to unit testing in software engineering. Remember the days when release cycles are getting longer because we got afraid to release? These are the times when the scale of analytical data becomes bigger than what we can handle with a few scripts.
I think it's early days and that still a lot of data engineering teams didn't reach the scale where they really see the need for applying these software engineering principles. But once you saw them, you cannot unsee them. To use the software engineering analogy: Even when I write a small script now, I still start with a unit test and I still consider the interface carefully. I think it will be the same with data engineering. More and more teams will reach the scale where these software engineering practices are needed. And IMO data contracts will become mainstream in the next years.
For this reasons, we have started to build a data contracts native solution at Soda.