Table of Links
2 Background
2.2 Fragmentation and PagedAttention
3 Issues with the PagedAttention Model and 3.1 Requires re-writing the attention kernel
3.2 Adds redundancy in the serving framework and 3.3 Performance Overhead
4 Insights into LLM Serving Systems
5 vAttention: System Design and 5.1 Design Overview
5.2 Leveraging Low-level CUDA Support
5.3 Serving LLMs with vAttention
6 vAttention: Optimizations and 6.1 Mitigating internal fragmentation
6.2 Hiding memory allocation latency
7.1 Portability and Performance for Prefills
7.2 Portability and Performance for Decodes
7.3 Efficacy of Physical Memory Allocation
7.4 Analysis of Memory Fragmentation
6.2 Hiding memory allocation latency
The serving framework invokes the step API in every iteration. The latency of step depends on how many new pages need to be mapped in the virtual tensors of KV-cache. Consider, for example, that the KV-cache of one request needs to be extended for Yi-34B which has 60 layers. This requires 120 calls to vMemMap each of which takes about 9 microseconds. Therefore, growing the KV-cache of one request by one page will add about 1 millisecond latency to the corresponding iteration and would grow proportional to amount of physical memory that needs to be mapped. We propose the following optimizations to hide the latency of allocation:
6.2.1 Overlapping memory allocation with compute. We leverage the predictability of memory demand to overlap memory allocation with computation. In particular, note that each iteration produces a single output token for every decode request. Therefore, memory demand for a decode iteration is known ahead-of-time. Further, in the decode phase, a request requires at most one new page. vAttention keeps track of the current context length and how many physical memory pages are already mapped for each request. Using this information, it determines when a request would need a new page and uses a background thread to allocate a new page when the preceding iteration is executing. For example, consider that a request R1 would require a new page in iteration i. When the serving framework invokes step API in iteration i-1, vAttention launches a background thread that maps physical memory pages for iteration i. Since iteration latency is typically in the range of 10s-100s of milliseconds, the background thread has enough time to prepare physical memory mappings for an iteration before it starts executing. This way, vAttention hides the latency of CUDA APIs by mapping physical pages in the KV-cache tensors out of the critical path. Note that in every iteration, step API still needs to ensure that physical pages required for the current iteration are actually mapped. If not, required pages are mapped synchronously.
6.2.2 Deferred reclamation + eager allocation. We observe that allocating physical memory for a new request can be avoided in many cases. Consider that a request R1 completed in iteration i and a new request R2 joins the running batch in iteration i+1. To avoid allocating new pages to R2 from scratch, vAttention simply defers the reclamation of R1’s pages and assigns R1’s reqId to R2. This way, R2 uses the same tensors for its KV-cache that R1 was using which are already backed by physical pages. Therefore, new pages for R2 are required only if its context length is bigger than that of R1.
We further optimize memory allocation by proactively mapping physical pages before they are needed. We do so by using one of the inactive reqId’s KV-cache. When a new request arrives, we can allocate this reqId without mapping
any physical pages. We then select a new reqId that would be allocated next and map physical pages for it. In most cases, these eager optimizations obviate the need to allocate new physical pages even for the prefill phase of new requests. Finally, we trigger memory reclamation only when the number of physical memory pages cached in vAttention falls below a certain threshold (e.g., less than 10% of GPU memory). We delegate both deferred reclamation and eager allocation to the background thread that the step API spawns.
This paper is available on arxiv under CC BY 4.0 DEED license.
Authors:
(1) Ramya Prabhu, Microsoft Research India;
(2) Ajay Nayak, Indian Institute of Science and Contributed to this work as an intern at Microsoft Research India;
(3) Jayashree Mohan, Microsoft Research India;
(4) Ramachandran Ramjee, Microsoft Research India;
(5) Ashish Panwar, Microsoft Research India.