I've watched three competent engineering teams disintegrate under the weight of their own dashboards. Not from malice. From certainty—the institutional conviction that what gets measured gets managed, that opacity breeds waste, that surely someone should know whether the engineers are actually working.
The tragedy isn't that metrics lie. It's that they tell a truth so partial it becomes worse than ignorance.
The Credibility Chasm
Start with the rupture itself: JetBrains surveyed the field and found 66% of developers do not trust the productivity metrics applied to their work. Not "dislike." Not "find imperfect." Do not trust. That's the language of broken contracts. When two-thirds of your practitioners reject the instrumentation, you no longer have a measurement problem—you have a legitimacy crisis. The people closest to the work have looked at the dials and said: these numbers do not represent what I do.
And yet the dashboards multiply.
Consider the AI acceleration paradox. GitHub Copilot publishes numbers claiming 55% faster coding. Duolingo reports code review velocity up 67%. The tools are real; I use them. They do make certain operations faster—boilerplate especially, the mechanical translation of intent into syntax. But then you ask the teams: are you shipping faster? The answer comes back flat or evasive. Delivery cadence hasn't budged. Sometimes it's gotten worse.
GitClear traced part of the mechanism: after AI adoption, lines of code per developer jumped 76%. That sounds like productivity until you examine composition. Most of it was scaffolding—generated stubs, verbose frameworks, the kind of code that takes seconds to write and minutes to understand. Token count went up. Semantic density went down. The codebase expanded while the feature set stagnated. We'd optimized for a metric (output volume) that had decoupled entirely from the thing we actually wanted (working software that solves problems).
The bottleneck didn't disappear. It moved.
Where the Constraint Really Lives
If you've only ever measured code entry speed, AI feels like a revolution. Suddenly the limiting factor—how fast can human fingers translate thought into text—evaporates. But that was never the real constraint in mature systems. The constraint is the build pipeline that takes eleven minutes and fails half the time due to flake. The constraint is the PR queue where reviews languish for three days because the only person who understands the auth layer is on-call. The constraint is the cognitive overhead of loading twelve different contexts into working memory because your ticket system doesn't distinguish between "fix typo" and "redesign the transaction boundary."
AI addresses none of this. It makes the easy parts faster and leaves the hard parts untouched, which means it just moves the pileup downstream. Now you're churning out more code that still has to pass through the same narrowed arteries—fragile tests, manual QA gates, deployment processes that require three approvals and a sacrifice to the YAML gods. The metric says you're more productive. The calendar says you're shipping the same features in the same time, just with more lines to maintain.
This is why velocity as traditionally measured is such a dangerous proxy. It captures motion but not progress.
The Gaming Spiral
Metrics don't just measure behavior—they warp it. Goodhart's Law isn't theoretical; it's Tuesday afternoon.
Take story point velocity. The idea seemed reasonable: estimate relative complexity, track how many points a team completes per sprint, use that as a planning input. Neutral. Descriptive. Except the moment you attach incentives—performance reviews, promotion criteria, management scrutiny—the number stops being a measurement and becomes a target. And humans optimize for the target.
One organization I know of started tracking individual velocity. Within six months, 60% of engineers reported lower job satisfaction and 40% left entirely. What happened in between? Estimate inflation. People started padding their numbers. Five-point tasks became eights. Eights became thirteens. Not because the work got harder—because the measurement got weaponized. Collaboration collapsed; why would you help someone else finish their ticket when it doesn't move your number? Technical debt accumulated because refactoring work is hard to justify when it doesn't generate visible points. The team started taking shortcuts, shipping brittle code to hit the sprint target, then spending the next sprint fixing what they'd rushed.
Ron Jeffries, who co-created Extreme Programming and helped invent the concept of story points, eventually warned against the very tool he'd built: obsessing over estimates detracts from Agile's actual purpose, which is delivering value. The system ate itself. The metric designed to facilitate planning became an extractive treadmill, and the team's actual output—the stuff users care about—deteriorated even as the dashboard showed green.
GitClear documented a related pathology after AI adoption: code duplication quadrupled. Refactoring rates plummeted. Why? Because copy-paste and generate-similar are fast, and fast looks good when you're measuring commits or line delta. Rewriting tangled logic for clarity takes time and produces fewer lines. If your metric is output volume, clean-up becomes irrational. You get rewarded for making the problem worse.
Volume instead of value. Motion instead of progress.
The Human Tax
There's a psychological dimension here that the dashboards miss entirely. The cycle looks like this: metrics show the team "underperforming" → manager applies pressure → engineers work nights → quality craters → metrics get worse → pressure intensifies. It's a tightening gyre. The StackOverflow 2025 survey found that senior developers now report lower job satisfaction than juniors. Think about that inversion. The people with the most context, the most skill, the most capacity to solve complex problems—they're the most demoralized. Partly because they remember when the work was different. Partly because they see through the metrics more clearly.
Secret surveillance makes it worse. Some organizations deploy "productivity monitoring" tools without transparency—trackers that log keystrokes, idle time, application focus. Developers know it's happening. They know the data is shallow and misleading (writing code is often a small fraction of software work; much of it is reading, thinking, debugging, discussing). They know management is making decisions based on this data anyway. The result isn't compliance—it's cynicism. Trust erodes. The social contract fractures. Good people leave.
And here's the insight that should be obvious but somehow isn't: AI didn't break these metrics. It revealed that they were always broken. When code entry speed was genuinely a bottleneck, velocity and commit counts seemed like reasonable proxies because they correlated loosely with progress. Now that AI has severed that correlation, the emptiness of the metric becomes undeniable. What we're measuring is not what matters. It never was. We just couldn't see it clearly until the terrain shifted.
What Actually Matters (And How You'd Know)
So if traditional metrics are poison, what's the alternative? Not measurement abandonment—you need feedback loops. But different telemetry, aimed at different questions.
The Developer Experience (DevEx) framework suggests a tri-partite approach: combine system-level data (DORA metrics—deployment frequency, lead time, change failure rate, time to restore) with human-centric signals (cognitive load surveys, friction reports, context-switching frequency) and business outcomes (did the feature move the metric it was supposed to move?). None of these alone is sufficient. Together they triangulate something closer to reality.
Ask different questions. Not "How many tickets closed?" but "How long does it take for a developer to get actionable feedback on a change?" That's lead time for code review, CI/build queue duration, the delta between commit and deploy. Those numbers reflect system health—where the friction actually lives. Not "What's the team velocity?" but "How often are developers being interrupted to context-switch between unrelated work?" High interrupt rates destroy flow, but they're invisible to output metrics.
Track code review duration, but don't just measure time-to-merge—measure why reviews are slow. Is it because the PR is too large? Because the reviewer is overloaded? Because the code is in a part of the system nobody understands? Each diagnosis points to a different intervention. Build queues that take fifteen minutes signal infrastructure problems. Tests that flake 20% of the time signal technical debt. Deployments that require manual steps signal automation gaps.
And—this matters—measure developer sentiment directly. Run short, frequent surveys. Ask: "Did you accomplish what you set out to do today?" "What blocked you?" "On a scale from 1–5, how frustrated are you with the tools?" Qualitative signals matter. When someone writes "the build is a nightmare and nobody will fix it," that's not anecdata—it's a leading indicator of attrition.
The Transparency Requirement
One principle cuts through most of the dysfunction: show people what's being measured and why. If you're tracking metrics, make the dashboard visible to the engineers themselves. Let them see the data that's being used to evaluate their work. This does two things.
First, it enables self-correction. If I can see that my PRs are sitting unreviewed for four days on average, I can ask why. Is it a review queue problem? Am I writing PRs that are too large? Is my work in a knowledge silo? The metric becomes a diagnostic tool rather than a judgment.
Second, it forces accountability upward. If management is making decisions based on flawed data and everyone can see the data, the flaw becomes discussable. You can have the conversation about what the number actually means, what it's missing, whether it should be weighted differently. Opacity breeds distrust. Transparency creates the possibility—not the guarantee, but the possibility—of alignment.
Monday Morning
If I were walking into a team suffering under bad metrics, here's what I'd do first:
Audit what's being measured and why. Trace each metric back to a decision it's supposed to inform. If you can't articulate the decision, stop collecting the metric. Data has a cost—attention, cognitive load, the pressure of being watched. Pay that cost only for signals that matter.
Kill individual velocity tracking immediately. Aggregate at the team level if you must, but even then, use it as a planning heuristic, not a performance lever. The moment you tie it to reviews or compensation, it becomes toxic.
Instrument the feedback loops. How long from commit to production? How long from "I need help" to "I got an answer"? How long from "CI failed" to "I understand why"? Those durations are where your leverage lives.
Run a developer satisfaction survey. Two questions, weekly: "What blocked you this week?" and "What would you fix if you could fix one thing?" Aggregate the answers. Fix the top-voted issues. Visibly. This builds trust and actually addresses constraints.
And accept that some things can't be quantified usefully. Code quality, architectural coherence, team knowledge transfer—these are real, they matter, but they resist reduction to dashboards. You have to evaluate them through code review, through conversation, through the judgment of senior practitioners who've built enough systems to recognize health and decay. That's slower and harder than checking a metric. It's also accurate.
The Underlying Dysfunction
The obsession with developer productivity metrics often stems from a deeper problem: management that doesn't trust its engineers. The dashboard becomes a substitute for relationship, for context, for the messy work of actually understanding what's happening. If you trust your team, you don't need to count their commits. You have conversations. You look at what's shipping. You ask where they're stuck. The numbers can inform those conversations, but they can't replace them.
Metrics are seductive because they promise certainty. They collapse the irreducible complexity of software work into a single number that trends up or down. But that certainty is false. The number is not the territory. The territory is a team of humans trying to build something durable in a system with a hundred moving parts, most of which are on fire at any given moment. You cannot dashboard your way to understanding that. You have to be in it.
So use metrics, but hold them lightly. Combine quantitative signals with qualitative judgment. Measure systems, not people. Optimize for flow and feedback loops, not output volume. And when the dashboard says one thing and the humans say another, believe the humans. They're the ones actually building the software.
The rest is just telemetry theater.