Table of Links
2 Background
2.2 Fragmentation and PagedAttention
3 Issues with the PagedAttention Model and 3.1 Requires re-writing the attention kernel
3.2 Adds redundancy in the serving framework and 3.3 Performance Overhead
4 Insights into LLM Serving Systems
5 vAttention: System Design and 5.1 Design Overview
5.2 Leveraging Low-level CUDA Support
5.3 Serving LLMs with vAttention
6 vAttention: Optimizations and 6.1 Mitigating internal fragmentation
6.2 Hiding memory allocation latency
7.1 Portability and Performance for Prefills
7.2 Portability and Performance for Decodes
7.3 Efficacy of Physical Memory Allocation
7.4 Analysis of Memory Fragmentation
5.3 Serving LLMs with vAttention
We build vAttention as a Python library that internally uses a CUDA/C++ extension for interacting with CUDA drivers. Our library exposes a set of simple APIs listed in Table 4 to the serving framework.
5.3.1 Initial setup: When the serving framework starts, each model worker loads the vAttention library and configures it with model parameters π β² , π», π·, π, π΅ and a preferred page size via the init API. Internally, vAttention reserves 2 Γ π β² virtual tensors (using our modified PyTorch caching allocator) for the KV-cache at each worker. These virtual tensors are reserved for the lifetime of the serving application. In addition, vAttention also pre-allocates physical memory pages during initialization. However, these pages are not mapped into the KV-cache yet.
5.3.2 Scheduling a new request: When a new request is scheduled for the first time, the serving framework obtains a new reqId from vAttention via alloc_reqid. All subsequent memory management operations of the request are tagged with this reqId.
5.3.3 Model execution: Before scheduling a batch for execution, the framework needs to ensure that the KV-cache sub-tensors of each active request are backed by physical memory. For this purpose, before dispatching the first kernel of an iteration to the GPU, the framework invokes the step API, specifying the current context length of each request (context length is set to 0 for each inactive reqId). Internally, vAttention ensures that enough physical pages are mapped for each active reqId before returning execution back to the framework. If vAttention cannot satisfy the memory demand, it returns with a failure in response to which a serving framework can preempt one or more requests to allow forward progress (this is similar to vLLMβs default behavior). We leave more sophisticated policies such as swapping out KV-cache to CPU memory as future work.
Depending on whether a request is in the prefill phase or decode phase, different number of physical memory pages may need to be mapped for a given iteration. The prefill phase processes the input tokens of given prompt in parallel and populates one slot in the K-cache (and V-cache) of the request at each layer of the model. Therefore, the number of pages needed to be mapped depends on the number of
prompt tokens being scheduled. If the total K-cache size of all prompt tokens at one layer of the model is π and page size is π‘, then each worker needs to ensure that at least (π +π‘ β 1)/π‘ physical memory pages are mapped in each of the 2 Γ π β² KV-cache sub-tensors of the given reqId.
For a request in the decode phase, the number of new pages required is at most one per request. This is because each iteration produces only one output token for a request. vAttention internally tracks the number of pages mapped for each request and maps a new page only when the last page allocated to that request is fully utilized.
5.3.4 Request completion. A request terminates when a user specified context length or the maximum context length supported by the model is reached, or when the model produces a special end-of-sequence token. The framework notifies vAttention of a requestβs completion with free_reqid. Internally, vAttention may unmap the pages of a completed request or defer them to be freed later.
This paper is available on arxiv under CC BY 4.0 DEED license.
Authors:
(1) Ramya Prabhu, Microsoft Research India;
(2) Ajay Nayak, Indian Institute of Science and Contributed to this work as an intern at Microsoft Research India;
(3) Jayashree Mohan, Microsoft Research India;
(4) Ramachandran Ramjee, Microsoft Research India;
(5) Ashish Panwar, Microsoft Research India.