sia.hackernoon.com

Have you ever been in a situation where you are assigned a data issue to fix, but you did not know:

The source of the data in question
The owner of the data.
Whom to reach out to for escalation
How often the data is refreshed
The data quality checks that have already been applied

If you have been in such a situation, then I can confirm that your organization did not set any Data contracts with the data producers.

What is a Data Contract?

Data Contract refers to a set of terms that have been agreed upon by data producers and downstream data consumers. It defines the structure, format, cadence, service level agreement, and quality expectations. This document should be able to answer all the WHAT about the data, along with providing concrete steps on events such as Failure of delivery, Bad quality of data.

Problems solved by Data Contract

Producers making uncommunicated changes that break downstream reports
Erosion of customer trust due to inconsistent or unreliable data
Ambiguous data ownership and accountability
Lack of column-level business context and definitions
Silent data quality issues that go undetected
No clearly defined validation rules at the column level
Unclear SLAs and undefined escalation steps during data issues or outages
Unclear access controls and permissions
Increasing regulatory and compliance demands

Key components of a Data Contract

Schema Definition:

Clearly defining the number of columns, exact column names, and data types.
Identifying the primary key and rules for each column.

Metadata:

Clear details on what the dataset is about, including a description and contact information for the owner.

Data Quality Rules:

Identifying rules per column.
Constraints such as uniqueness, completeness, data freshness, and custom business rules.

Service Level Agreement:

Data availability — is it 90% or 99.9%?
Cadence of data delivery — hourly, daily, or weekly.
Incident response timeline.

Business Context:

Business definition of each column, especially covering the KPIs reported for customers.
Business rules.

Lineage and Dependency:

Source of truth for columns and a list of teams involved if there is no single producer.
Details of consumers.
Identification of transformed columns and columns delivered as-is.

Access and Security:

Allowed customers.
Types of employees allowed to access the data.
Timeline of access.
Approvals required, if any, to expose data based on regions.

Implementation Approaches

S no.	Approach-1	Approach-2
1	Identify critical data pipelines	Define the standard version of a Contract with Stakeholders
2	Define all components of the Data Contract	Collaborate with Producer to finalize components such as SLA, Data quality etc.
3	Review with Stakeholders and Producer	Build Pipeline
4	Build pipeline	Validate
5	Validate

Sample Data Contract

Considering that you are a Data Engineering team that will be ingesting BILLING data from your upstream, and you need to set up an ingestion pipeline to make this data available in the data warehouse for the Data scientists and Data analysts.

Contract{
  metadata {
    Name: Billing
    Dataset: Fact_billed_charges
    Owner: Finance
    Version: 1.0
    }
  schema {
     Bill_id {
       type: BigInt
       descripion: Unique identifier of bill
       required: True
       pii:false
       }
     Customer {
       type: Varchar
       description: Customer name for whom bill has been generated
       required: True
       pii: True
       }
     Bill_date {
       type: Date
       descripion: Date on which bill was generated
       required: True
       pii:false
       }
     Bill_amount {
       type: Int
       description: Amount of bill without tax
       required: True
       pii: True
       }
    }
   Rules {
     Rule_1 {
     name: Bill_id_uniqueness
     type:Uniqueness
     column: Bill_id
     threshold: 100%
   }
     Rule_2 {
     name:Customer_completeness
     type:Completeness
     column:Customer
     threshold: 100%
     }
    }
   sla {
     availability:99%
     cadence: hourly
     incident_response :within 4 hours
     support_hours: 24 hours
   }
   access {
     classification:confidential
     retention: 5 years
     allowed customers : Data engineer
     encryption_type: AES-256 at rest
   }

Best Practices

Data contracts for critical data: It is not possible to put every dataset under the contract, hence start with high-severity data pipelines that cover key KPI's for the customers. Many attempts to apply data contracts fail when the effort is put into getting all the existing pipelines under the contract. Data stewards, along with stakeholders, should analyze using a rule of thumb to add data contracts when the cost of failure exceeds the cost of implementation.
Collaborate with stakeholders early: Data contracts require consumer, producer, and stakeholders to reach common ground, and this should be done as early as possible in the cycle. Nobody needs contracts that are not followed by anyone.
Data contracts automation: Treat contracts like code and store them in a code repository through git, make changes through pull request or make sure all changes go through the code review. It is near impossible to manage contracts manually, those machine readable contract can be used to set up rule checks on the data and alert when breached.
Establish Version Control: Maintaining historical versions of contracts, updating contracts as the requirements change, such as schema evolution or new business rules. Enabling this tracking of the trail of changes improves customer trust, as every change is associated with a requirement and makes contracts bulletproof.
Clear monitoring and Ownership: Every contract should have a dedicated team defined as the owner in order to represent the Producer, Consumer, and Stakeholder. Consumer teams can be considered as the first point of contact, but it is critical to alert the producer as well when the contract is breached. Another monitoring tool is to create monthly reports on the contract to understand if everything is green or if there have been recurring breaches that need to be addressed on a wider level.
Periodic Reviews: Strong data teams set up periodic reviews of all contracts. It is easy to create contracts one time and not assess them actively. In order for data contracts to be successful, it is critical to meet with stakeholders and producers to maintain their relevance and effectiveness.
Do not confuse data contract with Schema registry: Schema registry for any ingestion pipeline only covers schema definition, Data types for columns, Primary key, etc., but it is simply a part of data contract and not a data contract itself. Data contract should be designed to answer all questions about ingested data, not just the schema.

Challenges and Pitfalls

Data Contracts not prioritized: One of the biggest challenges faced by data teams is having their org not prioritizing the creation of data contracts, as it is being considered as a step that pushes final pipeline creation further down the chain. Everyone wants the data to be available yesterday, and when you bring another process that will require a data pipeline to be created later than expected will cause resistance. Hence, it is inevitable that org leaders are aligned on the concept of data contracts and why these are important if the org has to evolve and reduce the issues occurring from producer changes that happen out of the blue.
Incorrect Scope: A data contract too lenient will let the issues fall through the cracks, and an overloaded data contract will reduce the speed of development and keep on-call occupied with trivial issues. The onus to capture essential rules falls on the Data tea,m who need to make sure only the business-critical columns are under strict rules and unnecessary rules are avoided.
Bringing involved parties on the same page: If convincing the leaders of the data analytics org on data contracts was not hard enough, a challenge on which the success of data contracts hinges is bringing producers to comply with the rules applied to data contracts. Quite often, the producer team will push the removal of erroneous records on the consumer, as they do not see any value in spending time to fix the data issues at the source, and this is a practice that slowly kills data contracts, as all the issues then get filtered out or resolved with some manual intervention. Such scenarios make the common consensus inevitable.
No Training and Documentation: This particular challenge can not be overstated in any case; having any one of the parties involved without a proper understanding of data contracts will fail the whole standardization process. It is mission-critical to ensure engineers learn about data contracts on day-1, and documentation should clearly highlight that data contracts are not to slow down progress, but to move at a faster pace by catching issues before they reach the end customer.
Data is not static but changes quickly: Business changes at a fast pace, actually, it has to keep up with customer needs, and so do our data contracts. New fields are added, and old fields are deprecated at a pace difficult to imagine, and without data contracts, the source team would go on making those changes without informing or doing impact analysis on the downstream. Data contracts force these teams to communicate and collaborate with stakeholders to avoid churn.

Can AI Agents thrive without data contracts?

The world is ~~moving towards~~ or let me correct myself has already moved to, utilizing AI agents for numerous tasks, which requires providing autonomy to agents as they will be highly reliant on source data. If we need AI agents to completely take over critical tasks, then we need to provide them with strict guardrails, which will promise high-quality data and no silent data issues, which will easily trickle down to customers.

A typical process owned by an AI agent will include utilizing the data from numerous sources and producing the final output for the customers.

Take note of a situation where the producer has made changes to the business context of certain fields, but due to no data contracts being established, these changes have not been communicated. The AI Agent continues to deliver critical data for major KPI and no one finds this error until it pollutes other dashboards and reports. An issue with the business definition of the single column that could have been fixed if alerted on time will spread and lead to customer escalation, churn for Data and software development teams, and cause misaligned goals.

The seemingly simple reporting error explodes when you think of it in terms of healthcare or manufacturing, where a single data field can completely change the narrative for the end customer.

As AI moves from experimentation to production systems, data contracts are required for maintaining-

Compliance
Reliability
Customer Trust
Scalability.

Proper Governance in the AI Age Starts With Data Contracts