HERMES Rewrites the Rules of Streaming Video for Multimodal AI

This is a Plain English Papers summary of a research paper called HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding.

The streaming video problem nobody's really solved

Imagine watching a live security camera feed and answering questions about what's happening, right now, while the stream keeps flowing. Your brain effortlessly maintains context across frames, but AI systems face a brutal tradeoff: they need to remember everything for understanding, but remembering everything means running out of GPU memory within seconds.

This is the core problem that video understanding systems face. Multimodal Large Language Models have become genuinely good at understanding video. Show a 30-second clip and ask what happens, and modern systems get it right. But that works for offline video, where you process the whole thing at once. The moment someone tries to use these models on streaming video, everything breaks. The video doesn't stop arriving. Questions come in continuously. You need an answer immediately, not after computing for 10 seconds. And your GPU has finite memory, not an infinite buffer.

The nightmare scenario unfolds quickly: you're buffering incoming frames into the model's memory system, and within 30 to 60 seconds of streaming, you've run out of GPU RAM. Or you start dropping old frames to save memory, and then someone asks a question about something that happened early in the stream, and the model has no idea because you deleted it. Or you compress everything aggressively, and the model loses the fine details needed for accurate answers. Every approach fails because they're solving the wrong problem: they're trying to fit infinite video into finite memory by treating all time equally.

HERMES breaks this trap by revealing something fundamental: the part of the model that stores information doesn't need to preserve every detail uniformly. Just like you remember a movie not by recalling every frame identically, but by keeping key moments vivid while letting minor details fade, a model can strategically compress old information into coarser summaries while keeping recent frames crisp. The system achieves 10x faster response times and uses 68% fewer tokens than naive approaches, while actually improving accuracy on streaming benchmarks.

How transformers remember and why it breaks for video

To understand why HERMES works, you need to see how transformers remember things in the first place. When a transformer processes a sequence, each new token gets compared against everything that came before it using attention, a mechanism that computes how relevant each past item is to the current moment. To make this fast, transformers store the computed "keys" and "values" for each past item in a cache, the KV cache, rather than recomputing them every time.

For text, this is fine. A typical conversation might have hundreds or thousands of tokens. But video is fundamentally different. A single frame might be tokenized into hundreds or thousands of visual tokens, depending on resolution and how the vision encoder works. Stream video at 30 frames per second for just one minute, and you have 1.8 million visual tokens. Each one stays in the KV cache. Within about 30 seconds of streaming, you've filled your GPU memory budget.

Worse, every single new token the model processes has to attend to all of them. The computational cost of attention scales quadratically with sequence length. Process 100 tokens and attention requires 10,000 operations. Process 1 million tokens and it's a trillion operations. This is the context length problem, and it's why streaming video breaks existing models. The entire attention mechanism assumes the context is small enough to process quickly. Streaming video violates that assumption fundamentally.

Related systems have attempted solutions, like StreamMem, which uses query-agnostic memory strategies, but these approaches still struggle with the fundamental mismatch between infinite streams and finite computation.

Your memory is already hierarchical

The mechanistic insight in HERMES comes from studying how attention actually works in video understanding tasks. When you analyze which past frames the model attends to when answering a question, a clear pattern emerges: recent frames get attended to with fine details, older frames get attended to more coarsely. The model has already learned to treat its memory hierarchically, without being explicitly designed to do so.

The authors realized something crucial: instead of fighting this natural tendency, what if they engineered the KV cache to match this hierarchy deliberately? Keep recent video tokens at full resolution in the cache. Compress older tokens into summaries. This isn't a hack trying to force the model into unnatural behavior; it's mechanistically aligned with what the model is already trying to do.

This solves the infinite memory problem with elegance. You're not storing every detail forever. You're storing a pyramid: a small, detailed buffer of recent frames, and increasingly coarse summaries as you go back in time. The total memory usage stays bounded, even for arbitrarily long streams. The problem shifts from "how do we store infinite information" to a design question: "what's the right compression strategy for older information?"

Building the hierarchical cache system

The hierarchy works through a multi-tier KV cache strategy. Recent video frames are stored at full token granularity in the cache, functioning as working memory. These frames matter most for current questions, so full detail is preserved here. This tier is small, holding maybe 4 to 8 seconds of video.

As frames age out of this recent tier, they get compressed. Instead of storing every visual token, the system downsamples or aggregates groups of tokens together. This reduces memory by perhaps 4 to 8 times, but decent fidelity remains. A frame from 30 seconds ago sits in this middle tier.

Frames from much further back get compressed even more aggressively, maybe 16 to 32 times reduction. Only the essential semantic information is stored. A frame from five minutes ago is represented compactly. The key technical detail is that this compression happens automatically during inference without retraining the model. The system achieves this by intelligently selecting which tokens to keep or aggregate using attention scores. Tokens the model is already ignoring get dropped or merged. Tokens carrying important information get preserved.

How HERMES actually works in practice

When you start streaming video into HERMES, the timeline unfolds as follows. The first few frames go into the recent tier at full resolution. The model answers questions with complete information about these frames.

After roughly 30 seconds, the recent tier gets full. The oldest frames get promoted to the middle tier, compressed by about 8 times. Now the model has detailed memory of the last 30 seconds and coarser memory of the previous 30 seconds.

After several minutes, the middle tier starts getting full and promotes to the long-term tier, compressed another 4 times. A tiered hierarchy exists now: crisp details for the last half minute, medium summaries for 30 seconds to 3 minutes ago, and coarse summaries for everything older.

Here's where the real-time performance breakthrough happens. When someone asks a question, HERMES has already pre-computed and cached all the key-value pairs. When the question arrives, the system doesn't need to process the entire video history. It does one forward pass through the model with the question and the pre-built KV cache. This achieves 10 times faster time-to-first-token compared to approaches that reprocess video on each query. Continuous video keeps flowing, but responses come back in 50 to 100 milliseconds instead of 500 milliseconds to 1 second. The difference between responsive and sluggish is exactly this improvement.

Results that demonstrate the impact

The proof is empirical. HERMES gets tested on standard video understanding benchmarks adapted for streaming scenarios. The real-time performance shows 10 times faster time-to-first-token compared to state-of-the-art streaming baselines. When someone asks a question about a live video feed, they get an answer almost immediately.

But here's the surprising part: HERMES uses 68% fewer video tokens than a baseline that uniformly samples frames. In other words, it stores dramatically less information. Despite this massive reduction, it maintains or improves accuracy. On streaming-specific benchmarks, it achieves up to 11.4% improvements over prior methods. The model isn't just fitting within memory constraints, it's getting smarter.

Why does this happen? The hierarchical approach is more efficient than uniform compression. When you sample frames uniformly, you equally lose information from all parts of the stream. But recent frames matter more for answering questions. HERMES keeps recent frames crisp and older frames coarse, which turns out to be a better tradeoff than keeping everything uniformly mediocre. This principle extends to work on efficient multi-stage inference pipelines, where compression strategies fundamentally affect performance.

The results show that hierarchical memory is not just efficient in principle, it's actually more accurate than the baselines in practice.

Strategic forgetting beats perfect memory

There's a philosophical insight hiding in these results. We typically think of memory as something to maximize. More context, more information, better understanding. But HERMES demonstrates something counter-intuitive: strategic forgetting beats perfect memory. By deliberately compressing older information, the model actually performs better.

Raw video has a lot of noise and redundancy. A frame showing a static background is visually complex but informationally simple. Compressing it loses visual detail but preserves semantic information. The signal-to-noise ratio actually improves. When the KV cache is enormous, the model's attention mechanism gets diluted. There's so much to pay attention to that truly important information gets underweighted. A smaller, curated cache makes important information easier to find.

This also aligns with how real-time human understanding works. You don't remember a conversation in perfect detail from an hour ago; you remember the gist and the key points. HERMES works better precisely because it's doing something humans already do naturally.

Related work on efficient online video understanding has tackled similar problems, but HERMES uniquely leverages the hierarchical structure of attention itself as the organizing principle.

Why this matters beyond the paper

HERMES isn't just a faster inference system. It's a proof that the right way to think about context isn't "how much can we fit" but "what's the right structure for what we're storing." This principle extends beyond video. Any streaming understanding task, any real-time model, might benefit from asking: what granularity of memory do we actually need at each time scale?

The practical impact is immediate: real-time video understanding becomes feasible on resource-constrained hardware. But the intellectual impact is broader. The way transformers naturally attend to information should guide how you architect them. Sometimes, the key to better performance is engineering the right way to forget.

If you like these kinds of analyses, join AIModels.fyi or follow on Twitter.