I Stress-Tested 5 Data Catalogs With Real Governance Scenarios. Most Failed Silently.

Everyone said their catalog was enterprise-ready. I had twelve governance questions and a deadline. Here's what survived.

It wasn't a planned experiment. It started with an uncomfortable meeting.

A VP of Finance asked our data team a simple question: "Where does this revenue number come from?"

We knew the answer, roughly. It came from a pipeline. The pipeline reads from a source system. The source system was owned by a team in another timezone. The logic had been updated three times in the past year. The last person who fully understood it had left the company eight months ago.

We had a data catalog. We opened it in the meeting. It showed the table name, the column names, and a description that read: "Revenue data. Updated daily."

The VP looked at us. We looked at the catalog. The catalog looked back blankly.

That was the moment I decided to actually evaluate what we had and what we were missing.

What I Was Testing For

Most data catalog comparisons focus on features: does it have a lineage graph, does it integrate with dbt, can you search by column name? These matter. But they're not the real question.

The real question is: when governance actually fails, when a number is wrong, when data is misused, when an audit happens, does your catalog help you respond? Or does it make things worse by giving you false confidence?

I identified twelve scenarios from real situations I'd either experienced or seen cause incidents. A few of the ones that revealed the most:

A column is renamed upstream. Who gets notified downstream? A dataset contains PII. Can I find all tables with PII in under 5 minutes? A business metric's definition changed three months ago. Can I find the history? Two tables claim to be the source of truth for the same concept. Which one is right? A pipeline failed and produced nulls for 6 hours. Which downstream assets were affected?

I ran each scenario across five catalogs. I'm not using product names because this isn't a vendor shootout. I'll call them by the archetype they represent.

The Catalogs

The Librarian: mature, comprehensive, built by a team that clearly thought hard about metadata modeling. Feels like it was designed by people who read data management textbooks.

The Graph: lineage-first. Beautiful visualization of how data flows. Everything is a node and an edge. Impressive in demos.

The Modern Stack Native: built to integrate tightly with dbt, Airflow, and the modern data stack. If your stack is homogeneous and current, it fits like a glove.

The Enterprise Fortress: the one your procurement team will recognize. Built for compliance. Audit logs everywhere. UI from 2016.

The Lightweight Newcomer: fast to set up, clean UI, opinionated defaults. Clearly built by people who were frustrated with the other four.

What Actually Happened

Find All PII in Under 5 Minutes

I timed myself. Five minutes, starting from the catalog's search interface, to identify every table containing personally identifiable information.

This should be a solved problem. It is not.

The Enterprise Fortress won this one, not because the UX was good, but because it had mature classification and tagging features that, once configured, made the query trivial. The catch: "once configured" represented roughly two weeks of initial setup and ongoing curation work. The tags were only as good as whoever was maintaining them. When I searched for PII tables, I found 34. When I manually checked a sample, I found 3 that contained email addresses with no PII tag at all.

The Librarian had similar tagging, similar caveats, and similar coverage gap.

The others relied on either ML-based auto-classification, which had a 20 to 30 percent miss rate on non-obvious column names like uid or handle, or manual tagging that simply hadn't been done.

The lesson wasn't about the catalogs. A catalog can only surface what humans have told it. If your data classification effort is incomplete, and it almost always is, your catalog will give you false confidence that you've found everything.

Two Tables Claiming to Be the Source of Truth

This is the one that made me genuinely uncomfortable, because I recognized it immediately from the opening meeting.

I set it up deliberately: two tables, revenue_summary and revenue_final, with overlapping columns, similar descriptions, and different numbers for the same time period. I asked each catalog which one to trust.

None of them could answer this directly. That's not entirely a catalog failure, it's a governance failure upstream. But the catalogs differed dramatically in how they handled the ambiguity.

The Librarian showed both tables with equal prominence. No indication of which was authoritative. If anything, its clean, symmetrical presentation made them look equally valid.

The Modern Stack Native surfaced dbt model metadata that included a meta field where someone had written "use this one" in the description. Useful, but only because a human had done the work, and only because the catalog surfaced free-text descriptions prominently.

The Enterprise Fortress had a "certified" status that could be applied to datasets by designated data stewards. revenue_final had it. revenue_summary didn't. Clearest answer of any catalog, but it only worked because someone had actually done the certification. Three of the other catalogs had similar features that were completely unused because there was no process forcing anyone to use them.

The Lightweight Newcomer had a thumbs up/down voting feature. revenue_final had 4 upvotes. revenue_summary had 1.

Democracy is not governance.

Pipeline Failed, Find Downstream Impact

A pipeline failed at 3 am and produced null values for six hours before anyone noticed. I needed to know which dashboards, reports, and downstream models were built on this data and might now be showing incorrect numbers.

The Graph was built for this. Starting from the broken table, I could trace every downstream dependency in a visual flow. It took four minutes to produce a complete impact list. This is where lineage-first design earns its keep.

The Modern Stack Native could do this too, but only within its integrated ecosystem. Dashboards built in the BI tool weren't connected to the lineage graph unless someone had explicitly configured the integration. Half our dashboards weren't.

The Enterprise Fortress had lineage, but it was table-level and required manual traversal. For a complex dependency graph, this means clicking through fifteen screens to build a picture. The Graph showed in one.

The Librarian and the Lightweight Newcomer had no meaningful lineage for this scenario. The Newcomer's roadmap was very promising, though.

Regulatory Audit, Customer Record Lineage

This is where things got sobering.

A regulator asks: Show me every system that has touched data related to this customer, from collection to storage to processing.

None of the five catalogs could fully answer this. Not one.

The best I could do, using the Enterprise Fortress and The Graph together, was trace a customer record through the warehouse layer. I could not trace it through the operational systems that fed the warehouse. I could not trace what happened to it in the transformation steps at the row level.

This is not a catalog problem. Catalogs operate at the asset level, tables, columns, and pipelines, not at the record level. Row-level lineage is a different category of tooling entirely.

But here's the governance failure: two of these catalogs, in their marketing materials, claimed to support "end-to-end data lineage" and "regulatory compliance." What they meant was something much narrower. If you were relying on that claim to satisfy an actual regulatory requirement, you would have a very bad day when the auditor arrived.

Is This Dataset "Trusted"?

A team wants to build a new feature on top of a dataset. They ask the catalog: Should we use this?

Ultimately a people and process question that a tool can only partially answer. But the things that actually signaled trustworthiness across all five catalogs were the same: when was this last updated, and was it on schedule, are there data quality tests and are they passing, is there a named human who is responsible for this, is anyone else using it, and has a data steward explicitly marked it as production-ready.

No single catalog surfaced all five clearly. The Modern Stack Native came closest because dbt's test metadata gave it real freshness and quality information rather than metadata someone had to manually enter. The Enterprise Fortress had the best certification workflow. The Graph showed usage signals clearly.

The catalog I'd actually want doesn't exist yet. It would combine all five signals into a trust score that a non-technical analyst could read in ten seconds.

The Scorecard

	Lineage Depth	PII Classification	Governance Workflow	Ease of Use	Integration Breadth
The Librarian	Medium	High	Medium	Low	Medium
The Graph	Very High	Low	Low	Medium	Medium
Modern Stack Native	High	Low	Medium	High	Low
Enterprise Fortress	Medium	Very High	Very High	Very Low	High
Lightweight Newcomer	Very Low	Low	Low	Very High	Low

What I Learned That I Didn't Expect

Silent failures are the real danger. Every catalog failed at least one scenario. But the most dangerous failures weren't the ones that errored out. They were the ones who returned an answer confidently when the answer was wrong or incomplete. A catalog that tells you it found all PII tables when it actually found 80 percent of them is more dangerous than no catalog at all, because it creates compliance confidence that isn't warranted.

Governance is a process problem wearing a tool costume. Every feature I tested, certification, tagging, and ownership, only worked when a human process was driving it. The catalog that had the best certification workflow was useless without data stewards who actually performed certifications. The tool amplifies the process. It cannot replace it.

Lineage is not equal. Table-level lineage and column-level lineage are different things. Warehouse lineage and end-to-end lineage are different things. "We have lineage" is not an answer. Ask specifically: what level, what systems, how is it populated, and when was it last verified.

The catalog your analysts will actually use beats the catalog with the best features. The Lightweight Newcomer failed half of my governance scenarios. It also had the highest adoption among the teams I showed it to, because its search worked like Google, and its onboarding took twenty minutes. A perfect catalog nobody uses is worth less than a mediocre one that's become part of how people actually work.

The VP's question still doesn't have a good answer. After all this, I came back to the original problem: where does this revenue number come from? I can answer it better now. I can show the lineage, the owner, the tests, the last update time. But I cannot produce a complete audit-ready provenance trail for a single metric from source to dashboard. That's the gap the industry hasn't solved yet, and any vendor claiming otherwise is overselling.

What I'd Tell My Past Self

Buy the catalog that your least technical stakeholder can use without training. Then build the process that makes the features matter. In that order.

And if a vendor ever shows you a demo where the governance workflows look effortless, ask them how long the initial metadata population took. That's where the real cost lives. That's also where most implementations quietly die.

The catalog is not the strategy. The catalog is the place where the strategy becomes visible.