90% Failure Rate: Why GPT-4o Can't Optimize Code (And What We Built Instead)

Generative AI has solved the "blank canvas" problem, but it has introduced a new one: Unoptimized Bloat.

LLM-based coding agents have revolutionized sthe peed of delivery of new software. However, an over-reliance on these agents has led to a spike in unoptimized code. In software engineering, high performance is usually the result of deep analysis, research, and experimentation - steps that are often skipped when an agent is rushing to close a ticket.

Slow software has real costs: increased user latency, prolonged job processing times, and inflated compute bills. For systems running at scale, performance cannot be an afterthought.

At Codeflash, we have spent over two years building an "AI Performance Engineer"—an agentic workflow designed specifically to automate code optimization of all existing and new code. This can ensure that developers can use coding agents to write new code quickly and have it all be optimal as well.

In this article, we break down the architecture of a general-purpose optimizer capable of accelerating almost any Python code.

The High-Level Optimization Framework

Optimizing a massive codebase is daunting. Complex end-to-end flows involve thousands of codepaths. To tackle this, we treat the function as the atomic unit of optimization. Functions provide a natural abstraction layer that allows us to isolate logic and verify improvements rigorously.

We use a dual-verification framework to determine if a change is a "true" optimization:

Correctness (The Regression Check): The optimized code must behave exactly like the original. We verify this by running existing unit tests and generating new, diverse regression test cases via LLMs to cover edge cases. We instrument these test cases to rigorously verify that behavior remains unchanged.
Performance Gain (The Benchmark): The new code must be statistically faster. We run comparative benchmarks to ensure the speedup is real and not just noise.

Once we have a framework to verify optimization, the challenge shifts to discovery: How do we automatically find the best version of a given function?

Context is King

For a coding agent, context isn't just helpful; it is the raw material of intelligence. When deciding what context to feed to an LLM, we ask: "Does this piece of information help a human expert understand and optimize this code?"

We divide context into two buckets: Code Context (Static) and Runtime Context (Dynamic).

1. Code Context (Static Analysis)

This helps the model understand the structure and dependencies of the code.

Dependent Functions: Code rarely lives in isolation. If a function calls a slow helper function, optimizing the parent requires optimizing the helper function.
Imports: Knowing which libraries are imported (e.g., numpy vs. math) allows the model to suggest library-specific vectorization or distinct algorithms.
Global Variables & Constants: These provide insight into the symbols and data structures available in the global scope.
Class Structure: When optimizing a method, the __init__ helps the agent understand available attributes, ensuring it doesn't try to access variables that don't exist. Class methods also help the agent identify which methods it can call within the optimization.

Key Learning: We strictly limit edits to the code path being executed. This minimizes "spurious changes" that are stylistic edits that create noise in the code diff without improving performance.

2. Runtime Context (Dynamic Analysis)

Static code analysis is often blind to bottlenecks. You cannot always tell which line is slow just by reading it. This is where runtime data is critical.

Standard Profiling: Tools like pyinstrument or cProfile identify which functions are consuming the most time, allowing the agent to target the bottlenecks rather than optimizing rarely used utility functions.
Line Profiling: While standard profilers point to the function, line profilers point to the specific lines causing the bottlenecks within it. This visibility leads to ideas that can fix the bottleneck.
Input Shapes & Types: Algorithms behave differently at scale. For a list of length 2, a simple swap is faster. For a list of 1 million, you need Timsort. Feeding the agent real input data types allows it to choose the right algorithm for the actual workload.

Searching for the Optimal Code

Here is the hard truth: Zero-shot optimization rarely works.

In our internal benchmarks, over open source code bases, asking GPT-4o to "optimize this function" failed 90% of the time.[1]

62% of attempts were incorrect (introduced bugs).
28% were correct but not faster (or even slower).
Only ~10% were actual optimizations.

Optimization is a search problem. Even if you find a faster version, is it the fastest version? Likely not. To solve this, we moved from a "prompting" strategy to an "agentic search" strategy.

Here are some of the strategies that work well to efficiently find the optimizations -

Strategy 1: Stochastic Sampling

LLMs are probabilistic. Instead of asking once, we ask multiple times (with high temperature) to generate a wide distribution of ideas. By using different models and prompting for diversity, we explore the solution space broadly. We then apply semantic de-duplication to filter out identical logic before benchmarking.

When optimizing a piece of code, the LLM might combine three optimization ideas: two that speed up the code and one that slows it down.

We use a refinement step to prune the bad ideas. We compare the diff of the line profiling information between the original and the optimized version to isolate the specific lines contributing to the speedup. By reverting specific changes that don’t help with performance results in a crisp, minimal diff that is easier for a human to review, while having maximal performance gain.

Strategy 3: Self-Repair (The Feedback Loop)

Aggressive optimizations can often break edge cases. If a test fails, we don't immediately discard the attempt. Instead, we feed the error traceback to the agent:

"Your optimization failed test case X with error Y. Fix the bug while maintaining the performance improvement."

This allows us to salvage potentially brilliant algorithmic ideas that just had minor implementation errors.

Strategy 4: Deep Search (AlphaEvolve Style)

For critical bottlenecks, we employ a "deep search" method inspired by Google DeepMind's work [2]. We treat optimization attempts as a trajectory. We ask the LLM to critique its own previous attempts and improve on them, creating a chain of thought that evolves over time. This approach has allowed us to discover novel algorithmic improvements that a single-shot prompt would never uncover.

Summary

Automating performance engineering is the next frontier for code generation. It requires moving beyond simple "code generation" to a rigorous loop of Profiling → Generation → Verification → Refinement → Search.

By implementing these strategies—collecting deep runtime context and treating optimization as a search problem—Codeflash is now able to autonomously optimize complex Python code, from Machine Learning pipelines to heavy numerical algorithms.

The future of coding isn't just agents that write code; it's a software workflow with agents that treat code performance as a first-class citizen, and ensure all software is always optimal.

References:

Misra, Saurabh. “LLMs Struggle to Write Performant Code.” Codeflash, 25 May 2025, www.codeflash.ai/blog-posts/llms-struggle-to-write-performant-code.
Novikov, Alexander, et al. “AlphaEvolve: A Coding Agent for Scientific and Algorithmic Discovery.” arXiv.Org, 16 June 2025, arxiv.org/abs/2506.13131.