The hidden bottleneck in agent inference

Modern LLM inference systems face an unexpected constraint: it's not computation that's slow, it's data movement. This becomes painfully obvious when you scale up an agentic system, where AI agents iterate through multiple reasoning steps, calling tools and updating context. Each turn requires massive context windows to be loaded, processed, and kept synchronized across GPUs. The compute itself finishes quickly, but the system waits for data to arrive from external storage.

This problem reveals itself gradually as you add more concurrent agents to a system. Initially, adding agents increases throughput because GPUs that were idle now have work. But somewhere around 512 to 1,024 concurrent agents, adding more stops helping. The bottleneck isn't the compute anymore; it's the network links that feed data from storage into the processing engines.

Relative token throughput with varying request batch size, showing diminishing returns as concurrent agents increase

Throughput plateaus as batch size grows, revealing that something other than GPU compute limits performance.

The specific culprit is the KV-Cache: the stored key and value tensors that reduce recomputation during inference. For agentic workloads with 30K to 64K token contexts, this cache is enormous. Loading it from storage to GPUs dominates the total latency. When you have hundreds of agents running simultaneously, all needing their KV-Cache at roughly the same time, a single network path becomes a chokepoint.

Why current systems leave resources on the table

Most deployed inference systems use a disaggregated architecture: some GPUs specialize in the prefill stage (processing the agent's full context to compute initial attention states), while others specialize in decoding (generating one or a few new tokens at a time). This separation makes sense for efficiency, but it creates a resource imbalance in how storage bandwidth gets used.

All KV-Cache data flows through a single path: from external storage directly to the prefill engines. The network interfaces on the prefill side become saturated, working at maximum capacity. Meanwhile, the storage network interfaces on the decode side sit mostly idle. You've invested in hardware and expensive network connectivity for those decode engines, yet they barely touch the storage network.

This is the kind of waste that's hard to spot in early designs. The system still works; it's just not using what you've paid for.

The bottleneck asymmetry: storage NICs saturated on prefill engines, idle on decode engines

Left: Current architecture forces all KV-Cache through prefill storage NICs, leaving decode storage NICs underutilized. Right: DualPath distributes the load.

The dual-path insight

The core idea is deceptively simple: stop forcing all KV-Cache data through the same path. Instead, load some of it directly into decode engines. Those engines then transfer what the prefill engines need via a secondary network path.

This sounds like it might complicate things, but it actually solves multiple problems. First, it distributes storage bandwidth demand. Prefill and decode engines now share the work of reading from storage, so neither side gets saturated. Second, the inter-engine transfers use RDMA over the compute network, which is built for high throughput and doesn't interfere with the latency-sensitive communications that GPU execution requires. The storage network and compute network operate independently, so moving data via one path doesn't clog the other.

The breakthrough is reframing the problem. Instead of "our storage network is congested," the system now faces "how do we efficiently route data between engines?" That's a scheduling problem, and scheduling problems are solvable.

Orchestrating multiple data paths

Having two paths only works if the system makes smart decisions about which requests take which route. A global scheduler continuously solves a small optimization problem in real time: given current load on storage NICs at the prefill side, available capacity on the decode side, and the overhead of inter-engine transfers, which incoming requests should load their KV-Cache via which path?

The scheduler runs at request granularity, making decisions as new agent requests arrive. It considers the load across both prefill and decode engines, the network capacity between engine groups, and a safety margin to avoid congestion. If decode storage NICs have free capacity, some prefill work routes through decode first. If those NICs are also saturated, requests go directly to prefill as before.

Beyond the global routing decision, the system also manages load within individual GPU clusters. A compute-quota mechanism ensures that prefill and decode work don't starve each other on shared hardware. If a GPU has cycles available, the scheduler balances between processing new prefill requests and continuing decoding work already in flight. This prevents one operation type from monopolizing resources and leaving the other idle.

Scheduling logic for routing prefill work across decode engines via inter-engine transfers

The global scheduler selects which prefill engine receives data, considering load distribution and available network capacity.

The scheduler's role is crucial. Without it, dual paths create scheduling chaos. With it, underutilized hardware becomes a tool for rebalancing the system.

Performance in production workloads

Testing DualPath on real agentic workloads reveals substantial improvements. The evaluation covers three models: DeepSeek 27B and 660B, and Qwen 32B. Each test simulates agents with contexts ranging from 30K to 64K tokens, adding new context and generating output in multiple rounds, creating realistic multi-turn request patterns.

The offline results show throughput gains across all configurations. With 1,024 concurrent agents and 64K contexts, DualPath achieves up to 1.87x improvement on the largest model. The improvement is consistent: it shows up on small models, large models, and across different batch sizes.

Offline inference throughput across three models with varying agent counts and context lengths

N/A indicates the baseline system couldn't complete the test. DualPath handles scenarios where the original architecture fails.

The results vary with how you configure the prefill-decode ratio (how many GPUs you allocate to each stage). DualPath benefits most when both engine groups have some idle capacity, giving the scheduler room to route around the bottleneck. As the ratio shifts heavily toward one side, the improvement decreases because imbalance becomes the constraint instead of bandwidth saturation.

Varying context patterns show the optimization is robust. Whether agents append 100 tokens or 1,000 tokens to their context, whether generation length is 64 or 256 tokens, DualPath maintains its advantage. This consistency matters because real workloads are heterogeneous.

DualPath maintains throughput improvements across varying context and generation patterns

Left: Different amounts of new context appended per turn. Right: Different generation lengths. The optimization works across request patterns.

For online serving, where requests arrive continuously and systems must meet latency SLOs, DualPath averages a 1.96x throughput improvement without violating latency targets. Time-to-first-token, time-to-first-scheduled-token, and time-per-output-token all stay within acceptable bounds as arrival rates increase.

Online serving throughput with latency SLOs preserved

DualPath handles higher request arrival rates while maintaining latency constraints. The shadow shows stability over the final 150 seconds of each run.

Design principles that generalize

DualPath succeeds because it recognizes a fundamental dynamic: when one resource bottlenecks and another sits idle, the constraint isn't the scarce resource per se, it's the assumption that all traffic must use the same path. The system didn't need new hardware; it needed permission to route around the bottleneck.

This pattern extends beyond inference. Many distributed systems have asymmetric bottlenecks that persist not because they're unsolvable but because the straightforward routing became canonical early on. A good system asks: what happens if underutilized hardware becomes an intermediate waystation?

The broader insight is that I/O bottlenecks in large-scale systems are often scheduling problems in disguise. When storage bandwidth exhausts on one part of the network while another part stays quiet, the real constraint is how intelligently you can distribute requests across available resources. DualPath demonstrates that even in specialized, high-performance inference systems, a better scheduler can reclaim nearly 2x of wasted throughput.

For researchers building large-scale AI systems, the lesson is to measure actual bottlenecks rather than assume them. GPU utilization looks fine until you measure storage I/O patterns. Network utilization looks balanced until you segment by traffic type and source. DualPath emerged because someone measured the asymmetry and asked whether it was inevitable.


This is a Plain English Papers summary of a research paper called DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference. If you like these kinds of analysis, join AIModels.fyi or follow us on Twitter.