Abstract

This article offers an in-depth technical research-minded view of LM Cache operates and how the caching machinery improves the efficiency, scalability, and cost reduction of Large Language Model (LLM) deployment. We study different types of caching architectures and mechanisms, how they can be integrated together with the new AI infrastructure and evaluated for performance. Examples from the field detail how some of our largest customers are deploying LM Caches in practice and what they have learned along the way. Finally, we conclude by highlighting some challenges, limitations and future directions in this fast-evolving field.

Introduction

Think of an AI assistant that has to constantly provide the same answers all day long for hundreds of time! This isn't a result does it have to re-generate every answer from scratch, literally every time? While LLMs (think GPT-4, PaLM etc) are transforming everything from customer support to code generation their deployment at scale is stymied by latency requirements, computational cost and memory footprint. In production, each query to a large model may cost seconds and considerable GPU compute. As model sizes and user expectations increase, the need for fast inference becomes more important than ever.

Taking all of these into account, the LM cache emerges as a very smart answer to these problems by caching and reusing already calculated results. Caching is fundamentally what enables our system to remember all which it has seen before. The general idea of the concept is not new in computer science, but it has had a potential breakthrough in LLM itself for significantly larger throughput and a vast decrease in response time. Like model distillation (creating smaller, faster models), retrieval-augmented generation (bolting on external knowledge bases), and vector databases for semantic search, LM Cache augments other optimization techniques. With a final measure of caching mechanisms deployed in conjunction with these methods large-scale LLM deployments can endow them with not only the higher throughput and lower latency but also impressive savings across the board — making them far more practical to adopt.

Foundations of LM Cache

The concept of LM cache is simply a reuse computing for not overwork. In the case of LLM inference, caching can exist at different levels of granularity:

All of the aforementioned caching levels make use of the straightforward principle that system shouldn't re-compute what you have already done. Because of the self-attention mechanism, the computation of each new token in transformer-based LLMs frequently depends on earlier tokens. Whether at the prompt level or deep within the model, caching the outcomes of those earlier calculations allows subsequent operations to jump directly to the final result. Significant improvements in speed and resource usage result from this.

LM Cache Architectures and Strategies

Since not all workloads are the same in LLM systems, a large number of architectures and strategies have been designed by engineers to maximize cache hits and reduce overhead. Here are some key approaches:

In short, building a LM cache is a delicate balance: We want to maximize hit rate while minimizing overhead such as memory consumption and volatility. If this means you have to work those machine translation systems harder, then be my guest! These strategies (static vs. dynamic, distributed architectures, pulling in the cache hit on terminology by manufacturing it at client time with former example languages) are all tricks which may help you to do more with caching without tripping over cache-related pitfalls. Different applications may blend these methods and apply For example, a production system could use at the application level a semantic prompt cache (for instances when user queries resemble one another) and at the model level a KV cached (to speed up token generation), great bundle an embedding to be used by retriever component Another (caching) advantage is that this is a scalable and can be layered everywhere.

Architecture Diagram of LM Cache in an LLM Inference Pipeline

Below is the reference architecture depicting LM Cache’s integration within a typical LLM inference flow:

Workflow Diagram Comparing Cached vs. Non-Cached Inference Requests

To further understand the benefits, consider the workflow comparison between cached and non-cached inference requests:

Integration with Modern AI Infrastructure

Caching LM doesn't stand alone--it will derive from the wider application ecosystem in AI. Here we look at how caching fits into a variety of infrastructures and frameworks:

To summarize, modern AI stacks are more- and more so “cache-aware. Higher level systems are starting to internalize this, from libraries which take care of the mechanics for you to system architecture that shares caches across requests and infrastructure. Caching is something which most LLM services would use, as that's what the big players in AI (OpenAI, Google and Microsoft for example) have spoken about or done in their respective LLM offerings. Some specifics are well-known like OpenAI’s ChatGPT (not publicly documented but this event in particular is described) presumably utilizing large-scale conversation context caching; Anthropic’s Claude team (coming up next…) deployed a user-visible feat, cached prompts to save cost; Azure AI services Cache through ONNX Runtime and DeepSpeed; and Google’s PaLM model implemented multi-query attention to allow for deep-context caching as well (Chowdhery et al., 2022), Caching is one of the secret ingredients that enables these massive models to be served at scale to millions of users.

Performance Metrics and Evaluation

Alright, in theory caching might sound amazing but how can we measure it? There are few of the common performance metrics that helps you to evaluate the effectiveness of LM caching

If we evaluate LM caching, it is important to weigh tradeoffs where you can save on compute but at the cost of more memory; lower final latency, but need to handle occasional stale answers. Therefore, it is crucial to look at metrics such as the one above in conjunction. E.g., a 5% risk of an answer being slightly out-of-date might be worth getting your results 10× faster — or it might not (medical advice vs. casual chit-chat have different stakes!).

Both academic papers and industry reports are starting to incorporate caching in their benchmarks, which makes sense given the wide deployments of caches. In a later survey of 2024, it was observed that almost all top submissions to long-context LLM tasks relied on some form of caching or attention optimization to effectively deal with the context length (Chen et al., 2024). The numbers are consistently major victories. Simply put: caching reduces latency, increases throughput and decreases cost per query by both serving new queries faster and avoiding duplicate work. rohan-paul.com. Without these optimizations, any assessment of LLM serving would be far from complete.

Real-World Case Studies

We all know that caching is not just for the sake of an academic argument, and these features are bringing tangible value in practice. Here are some examples on different scenarios, just to show how caching helped :

In August 2024, Anthropic's Claude 3 added Prompt Caching, which was the first time commercial LLMs had caching for users. Anthropic's servers can store huge, unnecessary contexts, like a book (100,000 tokens). Writing to the cache costs 25% more per input token, but reading cached content only costs 10% of the base rate. Anthropic. The first call saves data; the next calls use the prompt ID to find it and send only new content. Clients said that responses were up to four times faster (latency went from 11.5 seconds to 2.4 seconds) and that the cost of 100,000-token prompts went down by 90% and the cost of 10,000-token prompts went down by about 86%. Anthropic. Conversational agents, coding helpers, and large-document Q&A are all common uses, and a stable context is best for cache hits. Anthropic made caching a strong business case by combining performance improvements with financial incentives. It's not surprising that competitors like OpenAI and Google are adding similar features to get the same cost and latency benefits.

Another case: OpenAI's system messages and few-shot examples caching. OpenAI has not elaborated on any of this publicly, but one might wonder if the developers fine-tuned and/or provided few-shot texts in each request, that portion could be cached server side so that the model does not have to re-encode those examples from scratch every time. It’s analogous to Anthropic’s approach. e.g. If you send the same 2000 token system prompt + Instruction on every API call (which many do), OpenAI would cache the model embedding of that inside and just concatenate with your query processed representation. It would be totally transparent to users but (could save them) in a huge amount of compute and would help OpenAI handle more load on their hardware.

Altogether, nearly all major AI companies are using LM caching in the real world. While OpenAI, Anthropic, Microsoft, Google and Hugging Face have all at some point discussed or documented caching strategies with different learnings as a result, most of these share the same conclusion: caching provides a giant performance boost, but it becomes important to manage cache freshness without sacrificing memory and scalability. Stories of “we pushed a cache and saw 30% decrease in our GPU usage while keeping the QPS intact” are not unheard. At the end of the day, these case studies signify one unmistakable fact for AI startups and enterprises across the board — caching is the ticket to turbocharging LLM applications.

Challenges and Limitations

Naturally, caching is no silver bullet. This is, of course, somewhat trivial — if it were easy everyone would be caching all the things after all! There are numerous challenges and limitations in practice that engineers have to work around when deploying the LM caches:

Nevertheless, the overall performance gains of caching generally significantly exceed this for many purposes. Thus, it can’t be a massive surprise that almost every enterprise-spanning LLM service involves caching of both one form or any other, and the difference is anyone putting this design work in to cope with the pitfalls. Mitigate the risks: Cache stuff that is not ultra-sensitive (or just changes slowly) + Use a lot of cache invalidation strategies + Keep an eye on the hit rate/ram usage and Have fallbacks (if something funky happens with your cache, system will work, it will be just slow). We are starting to see better tools and libraries to handle much of that automatically (for example, framework-level encrypting cache or auto-invalidate hooks on model updates).

LLM is extremely powerful, but not free as we have seen and adds an additional layer of complexity to LLMs — the cache. One of the more-known computer science thought experiments goes like this: There are only two hard things in Computer Science: cache invalidation and naming things. LLM academics now know the first one! But with good design, the trade-off can be managed, and immense benefits can be realized.

Future Directions and Research Opportunities

The LLM cache field is highly dynamic, with numerous exciting avenues that researchers and engineers are investigating to further expand its capabilities:

So, it seems the future of LM caching is quite sunny and filled with novel solutions. In general, the impetus is obvious: as we roll out bigger models for larger numbers of users, a caching strategy is an easy place to pick up some slack. The next iterations will improve caching (learning-driven), reduce its weight (compression and better management within the memory), and make it more pervasive across different modalities and for training/inference*. So it’s an Arms race, not just to build bigger models but to serve them smarter._* Through this, the caching and efficiency improvements are why LLMs like GPT-4 and Llama-2 can run so fast in chat interfaces (as one Hugging Face engineer quipped — “the reason GPT-4 and Llama-2 are able to run quick at all in chat interfaces is due to these caching and efficiency improvements”) huggingface.co. From here on perhaps those with the most aggressive cache strategy could potentially win the title of offering the fastest and ultimately cheapest AI services. Because, after all, why compute when you can cache?

References:

Anthropic. (2024, August 14). Prompt caching with Claude. Retrieved from Anthropic News:anthropic.comanthropic.com

Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the Knowledge in a Neural Network. NIPS Deep Learning and Representation Learning Workshop.

Lewis, P., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP. Advances in Neural Information Processing Systems.

Paul, R. (2025, April 20). Caching Strategies in LLM Services for both training and inference. Rohan’s Bytes blog.rohan-paul.comrohan-paul.com

Pol, S. (2025, April 6). Accelerating Transformer Inference with KV Caching. Medium blog. sudhirpol522.medium.com sudhirpol522.medium.com

Shazeer, N. (2019). Fast Transformer Decoding: One Write-Head is All You Need. arXiv:1911.02150arxiv.org.

Shutova, A., Malinovskii, V., Egiazarian, V., et al. (2025). Cache Me If You Must: Adaptive Key-Value Quantization for Large Language Models. arXiv:2501.19392arxiv.org.

Yazdani Aminabadi, R., Rajbhandari, S., Zhang, M., et al. (2022). DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale. arXiv:2207.00032arxiv.org.

Kwon, W., Li, Z., Zhuang, S., et al. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention (vLLM). Proceedings of SOSP 2023blog.vllm.ai blog.vllm.ai.

Hugging Face Transformers Team. (2023). Optimizing your LLM in production. Hugging Face Blog. huggingface.co huggingface.co

Lorello, S., McCormick, R., & Partee, S. (2023). How to Build a Distributed Inference Cache with NVIDIA Triton and Redis. NVIDIA Technical Blog.

(Additional references for academic rigor and completeness can be inserted as needed to match all citations above.)