There was one comment out there on Reddit that suggested getting contracts implemented required playing an elaborate game of politics that blamed upstream teams for warehouse downtime in post-mortems. Good luck doing that most places without getting yourself thrown in the bin...
Awesome write-up. I agree the idea sounds good, and we, as data engineers, have been fighting with bad data for decades. We just called it schema change or evolution.
IMO, Data quality tools integrated into orchestrators are the way. Especially if the orchestrator is data asset-driven in a declarative way. Meaning you can create assertions on top of data assets (your dbt tables, your data marts), not on data pipelines. So every time a data asset gets updated, you are certain the "contract" (assertions) are true.
I suspect a great way would be a SaaS like Snowflake to create the concept of a data contract object (semantics, more than just columns and data types.. ideally there should be some ISO defining data contracts for most core business entities.
- accessing data always means using an interface or an api, without a data contract even without consent and certainty that the data and its quality will be preserved. an sql interface is still an api.
- data contracts provide certainty as to who in the organisation owns or manages the data or the data source. owners of data are very often not data engineers.
- quality tools should be integrated into data contracts.
- nobody is forcing you to use avro or protobuf.
- nobody is preventing you from using python and sql.
data contracts are a tool to make the use of data more robust between different teams. they provide a framework for discussion.
There was one comment out there on Reddit that suggested getting contracts implemented required playing an elaborate game of politics that blamed upstream teams for warehouse downtime in post-mortems. Good luck doing that most places without getting yourself thrown in the bin...
Awesome write-up. I agree the idea sounds good, and we, as data engineers, have been fighting with bad data for decades. We just called it schema change or evolution.
IMO, Data quality tools integrated into orchestrators are the way. Especially if the orchestrator is data asset-driven in a declarative way. Meaning you can create assertions on top of data assets (your dbt tables, your data marts), not on data pipelines. So every time a data asset gets updated, you are certain the "contract" (assertions) are true.
I suspect a great way would be a SaaS like Snowflake to create the concept of a data contract object (semantics, more than just columns and data types.. ideally there should be some ISO defining data contracts for most core business entities.
i would like to disagree:
- accessing data always means using an interface or an api, without a data contract even without consent and certainty that the data and its quality will be preserved. an sql interface is still an api.
- data contracts provide certainty as to who in the organisation owns or manages the data or the data source. owners of data are very often not data engineers.
- quality tools should be integrated into data contracts.
- nobody is forcing you to use avro or protobuf.
- nobody is preventing you from using python and sql.
data contracts are a tool to make the use of data more robust between different teams. they provide a framework for discussion.