Through the centuries of technological progress, benchmarks have served as crucial mechanisms for measuring capability and comparing systems. In the late 18th century, the first dynamometers were built to measure human and animal strength — an early attempt to quantify physical performance.
A century and a half later, the Turing Test arrived as one of the first measures of machine intelligence. And by the late 20th century, benchmarks like SPECint and LINPACK had become the yardsticks of computing performance, defining eras of progress.
Fast-forward to the modern AI age, and a new wave of specialized benchmarks is emerging. For reasoning in codebases, there’s SWE-Bench. For agents operating in command-line environments, there’s Terminal-Bench. And now, for agentic context engineering, there’s Context-Bench.
Going from context to competence
As AI systems become more autonomous — combining tools, retrieving data, and executing plans — the question shifts from what they can do to how they manage information over time. Context-Bench is designed to test that ability: how well models can retain and apply context across long, multi-step tasks.
Context-Bench is the handiwork of the folks at Letta, a generative AI startup that spun out of UC Berkeley’s AI research lab last year with $10 million in funding. More broadly, Letta develops infrastructure for “stateful” agents — systems that can remember, reason, and adapt over repeated interactions. Its platform includes tools for context management, memory orchestration, and long-horizon task execution, aimed at helping developers design agents that learn from experience rather than starting from scratch each time.
With the launch of Context-Bench, Letta adds an empirical backbone, offering a standardized way to test how well systems handle memory, reasoning, and continuity.
Unlike traditional evaluations that score models on isolated problems, Context-Bench examines continuity — whether a model can maintain and reuse information across long tasks, chaining file operations, tracing relationships, and coordinating tool use without losing track of prior steps. The researchers describe it as a way to measure sustained context management rather than short-term recall.
“An agent’s ability to manage its own memory and state (or "agentic context engineering") is key to enabling continual learning,” Letta co-founder and CTO Sarah Wooders said at Context-Bench launch. “How can we measure context management as a core agentic capability, as we do with coding?”
This is a question that points to a deeper shift in how AI progress is measured, i.e. not just by intelligence, but by continuity.
“Agents running on models that do well on Context-Bench will excel at long-term learning as well as understanding how and when to pull in external information,” Wooders continued.
Measuring what models remember
In a nutshell, Context-Bench tracks how a model performs in an agentic setting — how efficiently it manages memory, how often it revisits prior context, and how much it costs to complete a task.
That cost dimension is important, producing interesting findings in the Context-Bench leaderboard. GPT-5, for instance, has lower per-token pricing than Anthropic’s Sonnet 4.5, yet costs more to complete the benchmark because it consumes more tokens overall. The current top performer, Sonnet 4.5, completes about 74 percent of the benchmark — leaving headroom for improvement.
The Filesystem Suite, for example, measures how well models can chain file operations, trace entity relationships, and manage multi-step information retrieval. The Skills Suite, meanwhile, evaluates how effectively they can identify, load, and apply relevant skills from a library to complete a task.
Each suite is composed of a series of controlled tasks — for example, locating and editing files within a simulated directory, or combining multiple tools to solve a long-horizon problem — with automated grading to verify whether the model reached the correct outcome and how it got there.
Open models, open benchmark
Another notable data point to emerge from the new Context-Bench leaderboards is that of open vs closed. While proprietary models like Claude Sonnet 4.5 and GPT-5 top the board, open-weights entrants such as GLM 4.6 and Kimi K2 are closing in — suggesting that progress in open research is beginning to translate into stronger performance on agentic tasks.
That openness also levels the field: smaller labs and open-weights models can pit themselves against proprietary systems using the same framework. In practice, it makes progress easier to measure — and harder to obscure — by giving every researcher access to the same transparent benchmark.
This focus on measuring context comes at a time when major AI labs are racing to extend the context capacity of their models.
“Frontier AI labs like Anthropic are now explicitly training their new models to be ‘self-aware’ of their context windows to increase their context engineering capabilities,” Letta co-founder and CEO Charles Packer said. “Despite the critical importance of agentic context engineering, there's no clear open benchmark for evaluating this capability. That's why we built Context-Bench.”
In many ways, Context-Bench captures a turning point in AI research — where progress depends less on raw scale, and more on how models manage what they already know. Measuring that may prove just as important as building the next model itself.