A revenue dashboard drops 18% overnight. The pipeline is ‘green.’ The lineage graph looks right. Query history shows the job ran successfully. Yet you still can’t answer the only question leadership cares about: what changed—and can we prove it?
Traditional lineage is built for discovery: it shows what depends on what. Incidents demand evidence: what exactly ran, on which versions, with which logic and checks, and what blast radius that run created. Graphs show paths; incidents require proof.
This article proposes a practical, vendor-neutral standard you can implement with tools you already have: Minimum Incident Lineage (MIL). MIL is not a lineage UI. It’s a run-level evidence schema—the smallest set of fields that makes incidents replayable, auditable, and fast to triage, without storing raw data.
Why ‘lineage’ isn’t enough during incidents
During an incident, these questions matter more than dependency paths:
- Which exact upstream versions were used in the bad run?
- Was the transformation logic different from the last known good run?
- Did a data quality check warn us, and did we publish anyway?
- Did the warehouse execute a different plan despite the same SQL?
- Who owns this asset, and what is the blast radius?
MIL targets incident questions, not just dependency questions.
MIL in one sentence
Minimum Incident Lineage (MIL) is the minimal run-level evidence you must capture for each dataset published to reproduce, triage, and audit a data incident—without storing raw data.
MIL design principles
- Replayable: evidence reconstructs input versions → transform → output version.
- Minimal: if it’s heavy, teams won’t emit it consistently.
- Safe by default: store proof, not payload (hashes/IDs/buckets).
The MIL schema: the minimum 12 fields
A) Run identity and timing
- mil_run_id—globally unique run identifier (orchestrator run + task + attempt)
- timestamp_start
- timestamp_end
- asset_id—stable catalog identifier (not just schema.table)
B) Input/output version evidence
- input_asset_versions[]—upstream (asset_id, version) pairs (snapshots/commits)
- output_asset_version—immutable version produced
- schema_fingerprint—hash of output schema (cols + types + order)
C) Transformation and execution evidence
- transform_fingerprint—hash of logic (normalized SQL/dbt hash/Spark code hash)
- execution_fingerprint—plan/config hash (warehouse plan hash, Spark physical plan hash, key params)
D) Quality, governance, and safety gates
- dq_gate_status—PASS | WARN | FAIL + dq_ruleset_version
- policy_tags_applied—tags at publish time (classification/masking/retention)
E) Ownership and impact
- owner_ref—team/on-call reference
- blast_radius—dependents count + tier/severity bucket
If you want a strict ‘12,’ make blast_radius the 12th field and enforce owner_ref via your catalog. In practice, most teams keep both because they remove the two biggest sources of incident latency: “Who owns this?” and “Who is impacted?”
A safe MIL event example (no raw data)
{
"mil_run_id": "airflow:dag=rev_mart,run=2026-01-25T09:00Z,task=build,try=1",
"asset_id": "catalog:dataset:rev_mart_v2",
"timestamp_start": "2026-01-25T09:00:03Z",
"timestamp_end": "2026-01-25T09:07:41Z",
"input_asset_versions": [
{"asset_id":"catalog:dataset:orders", "version":"iceberg:snap=8841201"},
{"asset_id":"catalog:dataset:customers", "version":"iceberg:snap=220993"}
],
"output_asset_version": "iceberg:snap=9912103",
"schema_fingerprint": "sha256:9b3c…",
"transform_fingerprint": "git:dbt_model_hash=3f2a…",
"execution_fingerprint": "warehouse:plan_hash=aa81…",
"dq_gate_status": {"status":"WARN", "ruleset_version":"dq:v7"},
"policy_tags_applied": ["pii:none", "masking:standard"],
"owner_ref": "oncall:data-platform:rev-marts",
"blast_radius": {"dependents_count": 37, "tier":"SEV2"},
"publish_action": "PUBLISHED",
"change_context": "pr:github:org/repo#1842"
}
Where MIL fields come from (implementation blueprint)
You can source MIL from systems you already run:
- Orchestrator (Airflow/Dagster/Prefect): mil_run_id, timestamps, attempt metadata
- Storage layer (Iceberg/Delta/Hudi): input_asset_versions[], output_asset_version
- Transform system (dbt/Spark/SQL repo): transform_fingerprint (git/compiled hash)
- Warehouse logs (Snowflake/Redshift/BigQuery/Databricks SQL): execution_fingerprint (plan hash/ID)
- DQ system (Great Expectations/Deequ/custom): dq_gate_status + ruleset_version
- Catalog/governance (Glue/Unity Catalog/DataHub/Collibra): asset_id, policy_tags_applied, owner_ref
- Lineage graph (optional): blast_radius (dependents count/tier), computed daily or on change
- The key is consistency: emit MIL on every publish, including ‘small’ datasets.
Walkthrough: ‘Revenue dropped overnight’ solved using MIL
Pipelines are green, but the dashboard is down 18% after the 9 AM refresh.
- Query MIL for rev_mart_v2 and pull the latest mil_run_id.
- Compare transform_fingerprint to last known good: logic change vs not.
- Check schema_fingerprint for silent drift that alters joins.
- Compare input_asset_versions[] to isolate the upstream change quickly.
- Check dq_gate_status (and publish_action, if present) for enforcement gaps.
- Use blast_radius to set severity and notify impacted teams.
- If snapshots exist, roll back to the prior output_asset_version with confidence.
Failure modes, MIL catches that dependency lineage often misses
- Silent schema drift that didn’t fail the job but changed joins (schema_fingerprint)
- Upstream snapshot changes that quietly affected metrics (input_asset_versions[])
- ‘Same SQL, different plan’ regressions (execution_fingerprint)
- Gates that warned but didn’t block publishing (dq_gate_status + publish_action)
- Unknown ownership and slow comms (owner_ref + blast_radius)
What MIL is (and isn’t)
MIL is an evidence trail for publishers, not a replacement for your lineage UI. You can still draw dependency graphs, but MIL gives each node a verifiable run card you can inspect during triage.
MIL is also not ‘logging everything for observability.’ It’s the minimum that lets you answer incident questions quickly and defend your conclusions in a postmortem.
Where MIL lives
Most teams store MIL events in a small append-only store that’s easy to query during incidents:
- A warehouse table partitioned by date (fast SQL during on-call)
- A log topic (Kafka/Kinesis) with a curated sink
- A catalog metadata store (if it supports event records)
The only hard requirement: MIL records must be immutable and queryable by asset_id and time.
Implementation tips that keep MIL minimal
- Emit MIL at publish time (or immediately after), not hours later.
- Normalize SQL before hashing (strip whitespace/comments, canonicalize identifiers) so fingerprints are stable.
- Treat versions as first-class: if a system can’t produce immutable versions, create your own (e.g., content hash + timestamp).
- Avoid sensitive content: store rule IDs and buckets (rowcount: 1M–10M), not samples.
- Start with Tier-1 assets, then expand. Consistency beats completeness.
MIL readiness checklist
- Tier-1 datasets emit MIL for every publish
- Output versions are immutable and addressable (snapshot/commit)
- Transform fingerprints are tied to source control and stable
- DQ gates output machine-readable status + version
- Ownership maps to an on-call rotation (not a person)
- Blast radius is computed at least daily (or on change)
- MIL avoids raw payloads; hashes/buckets only
- Retention matches your review window (e.g., 90–180 days)
Conclusion: MIL is ‘lineage as evidence’
Lineage helps you navigate systems. MIL helps you prove what happened. If you can answer ‘what changed?’ with evidence in under five minutes, you’ve built operational trust—not just lineage.