Revenue should be the easiest metric in the building. It was not.

It started, as these things always do, with a Slack argument.

A product manager posted a screenshot of a dashboard showing monthly revenue at $4.2M. A finance analyst replied with a different number: $3.8M. Both were pulling from the same warehouse. Both were confident. Both had reasons.

The product manager's number included pending orders. The finance analyst excluded them because pending orders aren't recognized revenue under their accounting policy. Neither was wrong. They were answering different questions using the same word.

I've seen this fight in every company I've worked at. The word "revenue" gets used forty times a day and means something slightly different each time. Usually, we fix it with a meeting, a Confluence page, and a shared understanding that lasts about three months before someone new joins and the cycle starts over.

This time I wanted to try something different. Semantic layers are supposed to solve exactly this problem. Define the metric once, in one place, and every tool (and every AI agent) that queries it gets the same answer. That's the promise. I wanted to see if it held up.

So I took one metric, defined it in four semantic layers, and checked whether they agreed.

Why This Matters More Than It Used To

A year ago, metric inconsistency was annoying but manageable. A data team could maintain a handful of blessed dashboards and correct people when the numbers drifted.

That worked when humans were the only consumers of your metrics. It does not work when AI agents are.

We're building NL-to-SQL interfaces now. A stakeholder asks, "What was revenue last quarter?" and an AI generates SQL, executes it, and returns an answer. If the metric definition lives in someone's head, or in a Confluence page the AI can't read, the agent will do what LLMs always do when they lack context. It will guess. Confidently.

The semantic layer is supposed to be the governance checkpoint between the question and the query. The place where "revenue" gets translated into a specific, sanctioned SQL expression before anything runs. If that layer doesn't work, or if different layers define revenue differently, you don't have governed analytics. You have an automated inconsistency.

The Metric

I chose net revenue because it's the metric most likely to cause disagreement. Gross revenue is usually simple (sum of order amounts). But net revenue requires decisions:

Do you subtract refunds at the time of the refund or backdate them to the original order? Do you include or exclude pending orders? Do you include orders that were placed but not yet invoiced? What currency conversion rate do you use for international orders, the rate at the time of order or the time of settlement? Do you count free trial conversions on the conversion date or the first payment date?

Each of these choices changes the number. They're all legitimate. The point of a semantic layer is to make the choice explicit, consistent, and enforceable.

I defined net revenue in four semantic layers: dbt Semantic Layer (MetricFlow), Cube, LookML, and Netflix's DataJunction. Same underlying warehouse tables. Same business rules (which I wrote down before starting, so I couldn't unconsciously drift between implementations). Same time period.

Then I queried each one for Q4 net revenue by region.

dbt Semantic Layer (MetricFlow)

MetricFlow went open-source under Apache 2.0 at Coalesce 2025. If you already use dbt for transformations, adding metric definitions feels natural. You define measures and dimensions in your dbt model YAML, and MetricFlow generates the SQL.

Defining net revenue was straightforward. I created a measure of type sum on the order_amount column, added a filter for order_status != 'refunded', and created a derived metric that subtracted a separate refund measure. The refund was backdated to the original order date, which matched my pre-defined business rules.

The query returned $3.81M for Q4.

What worked: the tight coupling with dbt models meant the metric definition sat right next to the transformation logic. One repo, one review process, one deployment. When I changed the refund logic, the PR showed exactly what changed and why. Version control for metric definitions is not a small thing.

What didn't: MetricFlow is a metric definition layer, not a full query engine. It doesn't have its own caching, access control, or API delivery. If I wanted to serve this metric to a BI tool and an AI agent simultaneously with row-level security, I needed additional infrastructure. The definition was clean. The delivery was incomplete.

Also, the documentation assumed I already thought in MetricFlow's specific vocabulary of measures, dimensions, and entities. If I hadn't been living in dbt for years, the learning curve would have been steeper than the actual metric logic warranted.

Cube

Cube is API-first. You define metrics in a data model, and Cube exposes them through a REST or GraphQL API with built-in caching and access control. It's designed for consumption, not just definition.

Defining net revenue required a different mental model. Cube uses measures and dimensions, too, but the schema file lives outside dbt, in its own repo with its own syntax. I defined order_amount as a sum measure, added a filter for non-refunded orders, and created a calculated measure for the refund subtraction.

The query returned $3.84M for Q4.

That's a $30K discrepancy from MetricFlow. On the same data. With what I thought were the same business rules.

I spent forty minutes finding the difference. It came down to currency conversion. MetricFlow applied the conversion rate at the time of order (because that's when the dbt model materialized the converted amount). Cube was querying a view that joined to a daily exchange rate table, and for orders placed on weekends, the rate fell back to the previous Friday. For a small number of international orders, the rate differed by fractions of a cent. Multiplied across thousands of transactions, that became $30K.

Neither was wrong. Both were defensible. But the whole point of a semantic layer is that this kind of divergence doesn't happen. And it happened because the metric definition was clean in both tools, while the underlying data path was subtly different. The semantic layer governed the formula. It did not govern the data it was applied to.

What worked: Cube's caching and API layer are excellent. I had a working REST endpoint for net revenue in under an hour. Access control was declarative and clear. If I were serving metrics to multiple tools (BI, AI, embedded analytics), Cube's delivery model is the most production-ready of the four.

What didn't: maintaining a separate schema repo from my dbt models meant two sources of truth for the same metric. When I updated the refund logic in dbt, I had to remember to update it in Cube too. I forgot once during testing. That's exactly the failure mode that semantic layers are supposed to prevent.

LookML

LookML is the veteran. Looker pioneered the semantic-layer-as-code approach, and it shows. The modeling language is mature, expressive, and opinionated about how you structure dimensions and measures.

Defining net revenue in LookML was the most verbose of the four, but also the most explicit. Every join relationship, every filter, every aggregation was spelt out in a way that left very little room for ambiguity. The refund logic was a derived table with its own SQL block, which made the business rules completely visible.

The query returned $3.81M for Q4. Same as MetricFlow.

This made sense. Both were operating on the same dbt-materialized tables with the same pre-converted currency values. The $30K Cube discrepancy was a data path issue, not a formula issue.

What worked: LookML's governance features are the most mature. Explore-level access controls, model-level permissions, and field-level descriptions that actually get surfaced in the UI. If you're already in the Looker ecosystem, the governance story is strong.

What didn't: it's proprietary to Google Cloud. My LookML definitions don't travel outside Looker. If I want to serve the same metric to a Python notebook, an AI agent, or a non-Looker BI tool, I either duplicate the definition or build a translation layer. In 2026, with AI agents increasingly consuming metrics programmatically, vendor lock-in on your metric definitions is a governance risk, not just an infrastructure inconvenience.

Google launched Looker Modeler to decouple the semantic layer from Looker's BI interface, but adoption is early and the ecosystem is still catching up.

Netflix DataJunction

DataJunction is the newest of the four. Netflix open-sourced it to solve exactly the problem I was testing: metric inconsistency across distributed teams. It uses a graph-based metadata model where metrics are nodes and their dependencies (dimensions, source tables, transformations) are edges.

The defining difference is that DataJunction decouples metric definitions from compute. You define what net revenue means in the graph, and DataJunction generates SQL for whatever engine you're running (Spark, Trino, the warehouse). The metric definition is portable by design.

Defining net revenue was surprisingly clean. I registered the base measures, defined the derived metric as a composition, and tagged the refund backdating logic as an explicit transformation node. The graph visualization showed me every dependency, which made the audit trail obvious.

The query returned $3.81M for Q4. Same as MetricFlow and LookML.

What worked: the graph-based model made lineage trivial. I could trace net revenue back through every transformation, join, and source table in one visual. When Netflix says they use this for "auditable metric lineage," I believe them. I could see every decision point.

For AI integration specifically, DataJunction's approach has an advantage that the others don't. Because the metric graph is queryable as metadata (not just as SQL output), an AI agent can ask "what does net revenue include?" and get a structured answer, not just a number. That's the difference between an agent that returns a result and an agent that can explain its result.

What didn't: DataJunction is young. The documentation is Netflix-scale-assumes-Netflix-context in places. The community is small. Setting it up took longer than any of the others because the ecosystem tooling isn't mature yet. If you're not running Spark or Trino, the engine support is limited.

Also, being compute-decoupled means DataJunction generates SQL but doesn't cache, serve, or secure the results. Like MetricFlow, it's a definition layer, not a delivery layer. You still need infrastructure around it.

The Scorecard

Net Revenue (Q4)

Definition Portability

Access Control

AI Agent Readiness

Setup Complexity

dbt (MetricFlow)

$3.81M

High (open-source)

None (needs infra)

Low

Low

Cube

$3.84M

Medium (own schema)

Built-in

Medium

Medium

LookML

$3.81M

Low (vendor-locked)

Built-in (mature)

Low

Low (if in Looker)

DataJunction

$3.81M

Very High

None (needs infra)

High

High

What I Learned That I Didn't Expect

The formula is the easy part. All four tools handled the metric definition itself just fine. The hard part was everything around it: currency conversion paths, refund timing logic, and which materialized table the query actually hit. A semantic layer governs the formula. It does not automatically govern the data the formula runs on. If two tools define revenue identically but query different materializations of the underlying data, you'll still get different numbers. That's the gap.

Only 5% of teams are using semantic models. Joe Reis's 2026 survey of 1,100 data practitioners found that number. After this experiment, I understand why. The tooling works. The organizational commitment doesn't. Defining a metric in a semantic layer takes an hour. Getting finance, product, and sales to agree on what the metric means takes a quarter. The technology is ahead of the process.

AI readiness is the new differentiator. A year ago, I would have ranked these tools on BI integration and developer experience. Now the question that matters most is: can an AI agent query this metric and explain how it was calculated? DataJunction's graph model and MetricFlow's open-source definitions are better positioned for this than LookML's proprietary ecosystem or Cube's separate schema. When your primary metric consumer shifts from a dashboard to an agent, the governance requirements change fundamentally.

The $30K discrepancy is the whole story. Three tools agreed. One disagreed by a small, defensible amount. In a board meeting, that discrepancy would derail the conversation. In an audit, it would raise questions. In an AI-generated report, it would go unnoticed until someone ran the numbers in a different tool and started another Slack argument. The semantic layer didn't cause the discrepancy. But it also didn't prevent it. That's the gap between definition governance and data governance, and most organizations have only addressed the first one.

What I'd Tell My Past Self

Pick the semantic layer your team will actually maintain, not the one with the best feature list. An unmaintained semantic layer is worse than no semantic layer, because it creates false confidence that the numbers are governed when they're not.

And if you're building NL-to-SQL or any AI-powered analytics, get the semantic layer in place first. Not after. The AI will query your data, whether you've governed it or not. The only question is whether it queries governed definitions or invents its own.

The metric is the contract. The semantic layer is where the contract gets enforced. And right now, most of us are enforcing it in Confluence pages that nobody reads, and AI agents can't access.