AI-Assisted Code Review: What Actually Works in Practice

"Just add an AI reviewer to your PRs, and you'll catch more bugs with less effort."

It's a nice story. Your pull request gets a free extra reviewer who never sleeps, points out every bug, and saves your senior engineers from yet another round of "did we validate input here?" Real life is messier.

I've been working with code review tooling for years. I maintain an ESLint plugin. I've configured PR bots, watched teams adopt them with excitement, and watched those same teams mute them within weeks. So when someone tells me AI code review "just works," I want to know: works how? For whom? Measured against what?

That's what this piece is about. Not a product comparison. Not a vendor ranking. A look at what the data says, what I've seen break, and what pattern actually holds up when you deploy it on a real codebase with real deadlines.

The checkbox problem

If you're shipping software in 2026, AI is in your workflow whether you choose it or not. Stack Overflow's 2025 survey puts it at 84% of respondents using or planning to use AI tools in development. 47.1% use them daily.[^1]

Daily. That changes things.

When a tool is always in the IDE, sliding it into PRs feels like the obvious next move. PR comments look like governance. They're visible, auditable, and attached to a specific diff. Managers point to a bot and say, "We improved review coverage." Security says, "We're doing AI-assisted review." Delivery says, "Cool, faster merges." Everyone gets a checkbox.

Nobody asks whether the checkbox caught anything.

I shipped a PR-bot config last year that looked great in demos and got muted in a week. It spammed the same nits on every change. Nobody held a meeting to kill it. People just stopped reading. That's the first disappointment: teams treat "AI review" like a new teammate, but teammates have attention budgets. Bots don't. A bot that eats attention without returning value becomes background noise faster than a flaky test.

The second disappointment is worse, actually. AI noise lands at the worst possible moment, when someone is already doing the hard work of understanding a change. If your PR flow is healthy, humans spend their review time on intent, invariants, and edge cases. Dropping probabilistic comments into that stream? It's like handing them a second diff, except this one is a list of claims that might be wrong.

What "working" actually looks like (hint: it's boring)

Forget the bot UI for a second. Most teams don't say "we need more comments." They say "we need fewer false alarms and fewer escaped issues."

That's a precision and recall problem. Not a prompt engineering problem.

Here's a useful anchor. AdaTaint, a 2025 research prototype, uses an LLM to adapt source and sink identification in taint analysis. The point isn't "add AI to everything." The point is: use the LLM to make deterministic analysis smarter. Their evaluation showed a 43.7% average false-positive reduction versus CodeQL, Joern, and an LLM-only pipeline, while improving recall by 11.2%.[^2]

Both directions at once. That's rare.

Usually, when someone says "we reduced false positives," they did it by hiding problems. Crank down sensitivity, sure, fewer alerts. Also, fewer catches. And when someone says "we improved detection," they usually mean more noise too. AdaTaint claims movement on both fronts. Whether it replicates at scale is an open question, but the direction matters.

Translate that to code review. The win isn't "the LLM left a clever comment." The win is the system change where the detection needle sits. Fewer useless alerts in the queue. Fewer real issues slipping through. Same humans making the call.

I keep coming back to false positives because they're not just annoying. They're expensive. Every false alarm costs a context switch, a triage decision, and a second look at a PR. And trust. Once engineers learn that a bot is wrong a lot? They stop reading it. At that point, your true positives are wasted too.

The uncomfortable part about LLM-only review

Here's the thing nobody wants to say out loud: if your AI can't reliably write correct code, it can't reliably review code either.

Not because review is inherently harder (though sometimes it is). Because ungrounded review comments are still model outputs. Sometimes insightful. Sometimes confident nonsense.

A 2023 study ran code generation benchmarks on HumanEval. Correctness rates: 65.2% for ChatGPT, 46.3% for GitHub Copilot, 31.1% for Amazon CodeWhisperer.[^3] HumanEval isn't your codebase. But it's a standardized way to say: even on small, well-defined tasks with clear tests, a big chunk of what these models produce is wrong.

That doesn't make AI useless in review. It means you treat it like a probabilistic assistant, not a gate.

The reliable loop looks like this: deterministic checks and tests tell you what's failing. The model helps you understand why and fix it faster. The unreliable loop is the reverse: the model speculates about what might be wrong, and you spend time proving it isn't.

And here's the failure mode people miss. It's not just "missed bug." The more expensive failure is a bug-shaped comment that burns reviewer time. Bot says "possible race condition." Now the reviewer has to load the concurrency model in their head, scan the diff, and decide if the comment means anything. If the bot is wrong, the team just paid for a deep review pass that wouldn't have happened otherwise. If it's right, you still need a test to prove it.

I'll almost contradict myself here. Sometimes a chatty LLM-only reviewer is useful. It can nudge someone into writing a missing test, catching a naming mismatch, or cleaning up a confusing branch. The problem isn't that it never helps. It's that you can't predict when it helps, and you can't budget for when it doesn't. Unless you constrain it.

Constraint. That's the whole difference between "assist" and "interrupt."

Security is where hype goes to die

Security teams have to explain failures in writing. If an LLM misses a vulnerability, "well, it seemed smart" doesn't go in the incident report.

A 2025 large-scale analysis ran CodeQL on AI-attributed code in public GitHub repos. 7,703 files analyzed. 4,241 CWE instances found.[^4]

Two things jump out.

87.9% of files had no identifiable CWE vulnerabilities. Good. "AI code is insecure" as a blanket statement is lazy. Most of it was fine under that lens.

But when vulnerabilities showed up, they clustered by language. Python: 16.18% to 18.50%. JavaScript: 8.66% to 8.99%. TypeScript: 2.50% to 7.14%.[^4] Dynamic languages plus fast-moving glue code plus "it works" patches that are undertested and underspecified. Sound familiar?

The takeaway isn't "use CodeQL and fire your reviewers." It's that deterministic scanners still do the repeatable part best. Finding CWE-shaped patterns at scale. That's what they're built for.

LLMs shine after the scanner fires. They can explain why a CodeQL alert matters in your repo's context. Propose a minimal patch. Suggest a test that makes the exploit path harder to reintroduce. They can cluster duplicates, which is really just a fancy way of saying "stop telling me about the same root cause in twenty files."

Static analysis finds. It doesn't teach. Humans teach, but it's expensive. LLMs can bridge that gap when they're grounded in what the scanner already proved.

Quick tangent that might sound unrelated. The healthiest teams I've worked with treat formatting as a solved problem. They run Prettier. They stop arguing about brace style. That cultural choice matters here because it reveals the real goal: fewer recurring debates, not more opinions. The same principle applies to security alerts. If the LLM is a new source of opinions, it increases debate. If it turns scanner output into actionable work, it reduces debate.

The tool landscape is shakier than you think

Most people asking "which AI code review tool should we use?" are really asking two different things. Which vendor will my org approve without a procurement war? And which product has a review experience that doesn't get muted?

We don't have great head-to-head precision and recall datasets for PR-bot review tools. That's not a dodge. It's a real gap. Vendors demo on curated diffs. Teams deploy on messy repos where half the bugs live in the interaction between old code and new code.

So adoption data becomes a rough proxy. Stack Overflow's 2023 survey: 54.77% reported using GitHub Copilot regularly. 5.14% for AWS CodeWhisperer. 1.25% for Codeium.[^5] That doesn't prove Copilot reviews best. It predicts which tool you can roll out without a fight and which one your developers already recognize.

Here's the part teams underestimate. Review integrations can disappear.

CodiumAI's PR Agent had 655 GitHub Marketplace installs and announced it would sunset the hosted version on December 1, 2025.[^6] Maybe you never used it. The specific product matters less than the lesson: if your review workflow depends on a hosted integration, plan for churn.

I get skeptical when teams build processes around the bot UI itself. CodeRabbit, Copilot's review features, Codeium, and Codium-style agents. They're treated as the product. In practice, they're a surface. A way to place findings into PRs.

Evaluate them like a surface, and the questions get concrete. Can it ingest deterministic outputs in a structured format like SARIF? Can it comment only when asked, or only once per PR? Can you export the config? Run it self-hosted?

Not flashy questions. They're the questions that determine whether your "AI reviewer" survives past the first wave of muting.

A pipeline that actually works (it's not exciting)

The best setup I've seen is boring and a little strict. Deterministic checks first. LLM downstream, turning verified signals into useful action.

You don't have to change your team's values about review. You're not replacing human judgment. You're trying to keep it focused on the parts that need it.

Locally, at the IDE. Formatting, linting, and type checking. Fast. This is the cheapest place to catch issues because the developer still has context loaded. Want an LLM here? Keep it in "explain and unblock" mode. "Why is this type error happening?" "What does this ESLint rule mean?" That's an assist. It reduces frustration without polluting the PR.

In CI. Tests and static analysis are the source of truth. ESLint and TypeScript (or mypy, or javac) are still unbeatable at certain correctness classes because they're deterministic and specific. For security: Semgrep, CodeQL, SonarQube. They're more useful as gates than LLM comments because you can tune them, baseline them, and audit them. Machine-readable output (SARIF) gives you a stable interface between "finding" and "presenting."

Then the LLM, but only then. Give it the diff, scanner results, and failing tests. Three jobs, kept tight:

Summarize confirmed findings in the repo context. Not "SQL injection is bad." Something like "this handler concatenates untrusted input into a query string, and our DB layer doesn't parameterize."
Propose a patch and a minimal test. Tests are the enforcement mechanism that makes AI help make things safe.
Cluster duplicates. One Semgrep rule fires in five places? One comment, not five.

What's missing from that list: open-ended "review the whole PR and tell me what you think." That prompt produces prose. Not a predictable value.

At the PR surface. This is where most teams go wrong because they maximize comments, thinking it means coverage. If you want your team to actually read the output, set a noise budget. Written down. Enforced. A single summary comment per PR, plus on-demand drill-down. Lots of tools support a /ai-review command. Use that. Engineers are way more tolerant of AI feedback when they ask for it.

Ban categories of comments you already have tools for. Prettier runs? Bot doesn't talk about formatting. ESLint rule exist? Bot doesn't relitigate it in natural language. Sounds harsh. It's how you keep attention where it counts.

A composite story, anonymized. Mid-size platform team rolls out a PR bot with inline comments on every PR. Catches a couple issues early, the kind that make good screenshots. Two weeks later, senior engineers have browser extensions to collapse the bot thread by default.

What made it useful again wasn't a better model. They limited the bot to SARIF-backed findings and one summary. Required that any suggested fix include a test or reproduction. The bot stopped being a commentator. Started being a triage assistant.

Who owns what

Static analysis owns detection in the classes it can model. Humans’ own intent and architecture. LLMs own triage and translation, as long as they're grounded in evidence.

That's the division of labor. Not philosophical. Operational.

Static analysis wins when you need determinism, repeatability, and auditability. CWE detection. Formatting. Lint rules. Type correctness. Dependency checks. Not glamorous. Stable. Scales across a monorepo in ways a conversational reviewer never will.

LLMs add value where determinism is the wrong tool. Explaining alerts in human terms that match your codebase. Drafting patches that respect local patterns. Reducing cognitive load by clustering and summarizing scanner output.

But be explicit about what you never hand off to an LLM alone. Don't use an LLM-only bot as a security sign-off. Don't run always-on nitpicking that fights your formatter. Don't let a suggestion merge without tests.

That last one is personal. I let an LLM suggest a "fix" once. It looked plausible, compiled, was wrong in a way that only showed up under a specific production data shape. Tests saved us. Correctness isn't a vibe. It's an invariant you enforce.

The HumanEval numbers make this sharp. Best performer in that 2023 study, ChatGPT at 65.2% correctness.[^3] Copilot at 46.3%. CodeWhisperer at 31.1%. A big fraction of AI-produced code needs correction. If the code is that probabilistic, review commentary from the same source is probabilistic too.

"But isn't this just more tooling and more cost?"

Fair objection. The PR path is already a bottleneck. Adding CodeQL, Semgrep, Sonar, plus LLM tokens, plus integration work, that's the opposite of "simpler."

And AdaTaint is a paper, not a product. Static analysis is famously noisy. These are real concerns.

The only honest way forward is narrow. Start with checks you already trust. Cap the noise to one PR summary. Run a time-boxed pilot with hard metrics: review time, escaped defects, muted alerts. If it can't prove value fast, don't scale it.

Let static analysis tell you what it can prove. Let tests tell you what actually runs. Let the LLM translate those signals into something a human can act on quickly.

Keep humans on the parts no tool can infer. Product intent. Failure modes. Whether this change makes the system easier or harder to evolve.

References

[^1]: Stack Overflow. Developer Survey 2025. https://survey.stackoverflow.co/2025

[^2]: AdaTaint: LLM-Adapted Taint Analysis for False Positive Reduction. arXiv:2511.04023, 2025. https://arxiv.org/abs/2511.04023

[^3]: Chen, M., et al. Evaluating Large Language Models Trained on Code. arXiv:2304.10778, 2023. https://arxiv.org/abs/2304.10778

[^4]: CWE Detection in AI-Generated Code on GitHub. arXiv:2510.26103, 2025. https://arxiv.org/abs/2510.26103

[^5]: Stack Overflow. Developer Survey 2023. https://survey.stackoverflow.co/2023/

[^6]: CodiumAI PR Agent. GitHub Marketplace. https://github.com/marketplace/codiumai-pr-agent