When Your Metrics Lie: The Illusion of Observability

There's a particular kind of organizational delusion I've watched play out in war rooms at three in the morning, and it goes like this: the dashboards are green, Datadog isn't paging anyone, the on-call engineer has refreshed Grafana six times in the last two minutes, and the VP of Product is on Slack asking why eleven thousand users just rage-quit the checkout flow. Everyone stares at the screens. The screens stare back, impassive, satisfied with themselves. CPU sits at a comfortable 40%. Memory pressure is nominal. Network throughput looks textbook.

The system is lying to you. Politely, comprehensively, in full color.

This isn't a tooling failure in the narrow sense — Datadog and Prometheus and CloudWatch are doing exactly what you asked them to do. That's the problem. You asked the wrong questions and built an observatory pointed at the wrong sky.

The Infrastructure Vanity Mirror

Most observability setups are, if you peel back the architectural rationale, anxiety management for engineers. Not user experience monitoring. Not reliability engineering. Anxiety management. Someone got paged at 2 AM because a disk filled up in 2019, so now there are disk utilization alerts. Someone's manager read a blog post about GC pressure, so now there are JVM heap graphs on the main dashboard. Layer by layer, over months or years, these dashboards accrete into something that feels comprehensive but is actually a portrait of the engineering team's fears rather than a model of user experience.

The technical term for what you've built is a resource metrics monoculture. You measure what the infrastructure reports because infrastructure is instrumented by default, because cloud providers surface it for free, because it satisfies the intuition that if the machine is healthy the service must be healthy. But the machine is a means. The service is the point.

Here's a concrete fracture I've seen more than once: an e-commerce platform where the payment provider's SDK was silently catching timeout exceptions and returning a structured error object that the application layer was quietly swallowing — logging a warning, not an error, because someone had made an opinionated call about log verbosity in a PR review eighteen months earlier. The infrastructure was pristine. The pod was healthy. The service mesh showed normal latency distributions. Meanwhile, roughly 23% of checkout attempts were failing with a generic "something went wrong" message, invisible to every monitor except the one no one had built yet: transaction completion rate.

That gap — between the system's self-reported health and the reality users inhabit — is what bad observability manufactures. It's not noise. It's signal inversion.

Why Average Latency Is Almost Always Useless

Let me be more specific about a mechanism that bites people repeatedly because it's genuinely counterintuitive until you've been burned.

Average latency hides its sins in the distribution's tail. Suppose your API has a p50 response time of 180ms — fast, comfortable, publishable in a postmortem as evidence of system health. The p99, if you're not looking at it, might be 14 seconds. That means roughly one in a hundred requests is timing out from the user's perspective, which at ten thousand requests per minute means a hundred users every minute hitting a wall. Your average looks fine because the distribution is right-skewed and you're measuring the median, not the extremes.

What causes that tail? Could be database connection pool exhaustion under specific concurrency patterns. Could be a cold-start penalty on a lambda that handles a minority of traffic shapes. Could be a downstream service with unpredictable GC pauses. The point isn't the cause — the point is that your p50 graph will never surface it. You need percentile slicing, and you need it disaggregated by endpoint, by customer cohort, by traffic source, because aggregation is where truth goes to die.

The engineers who understand this instinctively start with questions about the shape of failure before they write any instrumentation. What does a degraded checkout look like from the user's perspective? Slow? Silent? Error message? Does it degrade uniformly or in clusters? Does it correlate with geography, time of day, browser, account age? These questions determine what you instrument. You're not collecting telemetry speculatively and hoping patterns emerge — you're designing a measurement apparatus around a falsifiable model of how the system breaks.

SLOs Are Not Bureaucracy, They're a Forcing Function

Service Level Objectives get positioned, badly, as a compliance exercise — something SRE teams present to management in quarterly reviews using PowerPoint slides with traffic-light formatting. That framing kills their utility completely.

An SLO is more usefully understood as an epistemological commitment. You're declaring, in advance, what constitutes meaningful degradation versus acceptable variation. You're drawing the line between "interesting infrastructure event" and "user-impacting incident." And crucially — this is the part most teams skip — you're doing it in collaboration with product, because the line isn't a technical judgment. It's a business judgment about what level of reliability your users expect and what it costs to exceed it.

When you define checkout success rate as an SLI with a 99.5% success rate SLO over a rolling 28-day window, you've done several things simultaneously. You've created an alert that fires based on user experience rather than resource consumption. You've created an error budget — roughly 0.5% of transactions can fail before you breach — which gives you a concrete quantity to spend or conserve when making deployment decisions. And you've created a shared language between engineering and product for conversations about risk that doesn't require either side to translate from technical metrics.

The error budget is particularly underappreciated as a decision tool. If you've burned through 80% of your monthly error budget in the first ten days, you shouldn't be deploying new features, because each deployment carries some probability of further degradation. If your error budget is fully intact at day 28, that's an argument — a quantitative one — for accepting more deployment risk, experimenting more aggressively. The budget makes the risk legible. That's its function.

What You're Actually Buying When You Buy "Full Observability"

The enterprise observability market will sell you a platform that ingests everything — every log line, every trace, every metric, every event — and promises that with sufficient data you'll achieve total system comprehension. This is a seductive and largely false proposition.

The failure mode is subtle. When you instrument everything without a prior model of what matters, you end up with an enormously expensive noise machine. Storage costs compound. Query times grow. Engineers, faced with an incident, open the logging platform and encounter millions of rows with no obvious signal gradient — everything is recorded, nothing is prioritized, and the act of finding the relevant trace in a sea of irrelevant traces can take longer than the incident itself. I've watched postmortems where the majority of time was spent not diagnosing the failure but just locating the evidence. That's not observability. That's archaeology.

The engineering discipline here is uncomfortable because it requires restraint in an environment that rewards comprehensiveness. You have to make active decisions to not capture certain logs. To drop high-cardinality traces below a sampling threshold. To resist adding that new dashboard panel because it monitors something interesting but not something actionable. Observability debt is real — accumulated instrumentation that nobody reads, maintains, or acts on, but which costs money and cognitive overhead every single month.

A useful heuristic I've landed on: for every metric or log source you add, ask "what decision would I make differently if this number changed?" If the answer isn't crisp and immediate, you probably don't need it yet. Maybe you will need it. Keep it in a runbook as something to enable during specific incident investigation patterns. Don't put it on the main dashboard where it'll train engineers to ignore signal by surrounding it with noise.

The Monday Morning Diff

Concrete changes. What would I actually do.

First: audit your existing alerts for infrastructure-to-outcome ratio. If more than 60% of your firing alerts in the last 90 days were infrastructure metrics that didn't correlate with user complaints, you have measurement misalignment. Start eliminating or downgrading those alerts. Not all of them — disk space still matters, runaway processes still matter — but anything that fires without triggering a user-facing degradation should be at most a Slack notification, not a PagerDuty wake-up.

Second: instrument your critical user paths explicitly, from the outside. Not via application logs from inside the service, but via synthetic monitoring that executes the actual flow — login, add to cart, checkout, confirmation — on a schedule, from multiple regions, against production. This is boring infrastructure. It is also frequently the only monitor that catches the class of failure where every internal metric is healthy but the user flow is broken because of an integration contract violation, a config drift, a third-party API behavioral change.

Third: define SLIs for the two or three things that, if broken, would cause your customers to leave. Not ten things. Two or three. For most products this is something like: can users accomplish their primary action successfully, and does it respond within a time threshold they'd consider acceptable? Instrument those precisely. Set SLOs on them. Build your alerting around SLO burn rate — burn rate alerting, where you alert when you're consuming error budget faster than sustainable, is more useful than threshold alerting because it gives you early warning on gradual degradation rather than only catching acute failures.

Fourth — and this is the one that meets the most organizational resistance — sit with a product manager or customer success lead and ask them: in the last three months, what user complaints did we get that engineering didn't know about until someone told us? The answer to that question maps exactly the blind spots in your observability. Build from there.

The dashboards will stay green sometimes while users suffer. That gap will never close entirely — distributed systems are too complex, too dependent on the behavior of systems you don't control, too subject to emergent failure modes nobody predicted. But the gap shrinks dramatically when you design your observability around what users experience rather than what infrastructure reports. The machines will always report that they're fine. They don't know any better. That's your job.