Generative AI has solved the "blank canvas" problem, but it has introduced a new one: Unoptimized Bloat.


LLM-based coding agents have revolutionized sthe peed of delivery of new software. However, an over-reliance on these agents has led to a spike in unoptimized code. In software engineering, high performance is usually the result of deep analysis, research, and experimentation - steps that are often skipped when an agent is rushing to close a ticket.


Slow software has real costs: increased user latency, prolonged job processing times, and inflated compute bills. For systems running at scale, performance cannot be an afterthought.


At Codeflash, we have spent over two years building an "AI Performance Engineer"—an agentic workflow designed specifically to automate code optimization of all existing and new code. This can ensure that developers can use coding agents to write new code quickly and have it all be optimal as well.


In this article, we break down the architecture of a general-purpose optimizer capable of accelerating almost any Python code.


The High-Level Optimization Framework

Optimizing a massive codebase is daunting. Complex end-to-end flows involve thousands of codepaths. To tackle this, we treat the function as the atomic unit of optimization. Functions provide a natural abstraction layer that allows us to isolate logic and verify improvements rigorously.


We use a dual-verification framework to determine if a change is a "true" optimization:

  1. Correctness (The Regression Check): The optimized code must behave exactly like the original. We verify this by running existing unit tests and generating new, diverse regression test cases via LLMs to cover edge cases. We instrument these test cases to rigorously verify that behavior remains unchanged.
  2. Performance Gain (The Benchmark): The new code must be statistically faster. We run comparative benchmarks to ensure the speedup is real and not just noise.


Once we have a framework to verify optimization, the challenge shifts to discovery: How do we automatically find the best version of a given function?


Context is King

For a coding agent, context isn't just helpful; it is the raw material of intelligence. When deciding what context to feed to an LLM, we ask: "Does this piece of information help a human expert understand and optimize this code?"

We divide context into two buckets: Code Context (Static) and Runtime Context (Dynamic).

1. Code Context (Static Analysis)

This helps the model understand the structure and dependencies of the code.



Key Learning: We strictly limit edits to the code path being executed. This minimizes "spurious changes" that are stylistic edits that create noise in the code diff without improving performance.

2. Runtime Context (Dynamic Analysis)

Static code analysis is often blind to bottlenecks. You cannot always tell which line is slow just by reading it. This is where runtime data is critical.


Searching for the Optimal Code

Here is the hard truth: Zero-shot optimization rarely works.


In our internal benchmarks, over open source code bases, asking GPT-4o to "optimize this function" failed 90% of the time.[1]



Optimization is a search problem. Even if you find a faster version, is it the fastest version? Likely not. To solve this, we moved from a "prompting" strategy to an "agentic search" strategy.


Here are some of the strategies that work well to efficiently find the optimizations -

Strategy 1: Stochastic Sampling

LLMs are probabilistic. Instead of asking once, we ask multiple times (with high temperature) to generate a wide distribution of ideas. By using different models and prompting for diversity, we explore the solution space broadly. We then apply semantic de-duplication to filter out identical logic before benchmarking.

Strategy 2: Optimization Refinement

When optimizing a piece of code, the LLM might combine three optimization ideas: two that speed up the code and one that slows it down.


We use a refinement step to prune the bad ideas. We compare the diff of the line profiling information between the original and the optimized version to isolate the specific lines contributing to the speedup. By reverting specific changes that don’t help with performance results in a crisp, minimal diff that is easier for a human to review, while having maximal performance gain.

Strategy 3: Self-Repair (The Feedback Loop)

Aggressive optimizations can often break edge cases. If a test fails, we don't immediately discard the attempt. Instead, we feed the error traceback to the agent:


"Your optimization failed test case X with error Y. Fix the bug while maintaining the performance improvement."


This allows us to salvage potentially brilliant algorithmic ideas that just had minor implementation errors.

Strategy 4: Deep Search (AlphaEvolve Style)

For critical bottlenecks, we employ a "deep search" method inspired by Google DeepMind's work [2]. We treat optimization attempts as a trajectory. We ask the LLM to critique its own previous attempts and improve on them, creating a chain of thought that evolves over time. This approach has allowed us to discover novel algorithmic improvements that a single-shot prompt would never uncover.


Summary

Automating performance engineering is the next frontier for code generation. It requires moving beyond simple "code generation" to a rigorous loop of Profiling → Generation → Verification → Refinement → Search.


By implementing these strategies—collecting deep runtime context and treating optimization as a search problem—Codeflash is now able to autonomously optimize complex Python code, from Machine Learning pipelines to heavy numerical algorithms.


The future of coding isn't just agents that write code; it's a software workflow with agents that treat code performance as a first-class citizen, and ensure all software is always optimal.


References:

  1. Misra, Saurabh. “LLMs Struggle to Write Performant Code.” Codeflash, 25 May 2025, www.codeflash.ai/blog-posts/llms-struggle-to-write-performant-code.
  2. Novikov, Alexander, et al. “AlphaEvolve: A Coding Agent for Scientific and Algorithmic Discovery.” arXiv.Org, 16 June 2025, arxiv.org/abs/2506.13131.