Table of Links
2 Background
2.2 Fragmentation and PagedAttention
3 Issues with the PagedAttention Model and 3.1 Requires re-writing the attention kernel
3.2 Adds redundancy in the serving framework and 3.3 Performance Overhead
4 Insights into LLM Serving Systems
5 vAttention: System Design and 5.1 Design Overview
5.2 Leveraging Low-level CUDA Support
5.3 Serving LLMs with vAttention
6 vAttention: Optimizations and 6.1 Mitigating internal fragmentation
6.2 Hiding memory allocation latency
7.1 Portability and Performance for Prefills
7.2 Portability and Performance for Decodes
7.3 Efficacy of Physical Memory Allocation
7.4 Analysis of Memory Fragmentation
5 vAttention: System Design
Our goal is to improve efficiency and portability by adding dynamic memory allocation support to existing kernels. To achieve this goal, vAttention leverages system support for
dynamic memory allocation instead of implementing paging in user space.
5.1 Design Overview
vAttention builds on the ability to allocate virtual memory and physical memory separately. Specifically, we allocate a large contiguous buffer for the KV-cache in virtual memory ahead-of-time (similar to reservation-based allocators) while deferring the allocation of physical memory to runtime i.e., allocate physical memory only when required (similar to PagedAttention). This way, vAttention preserves virtual contiguity of KV-cache without wasting physical memory. This approach is feasible because memory capacity and fragmentation are limiting factors only for physical memory whereas virtual memory is abundant e.g., modern 64-bits systems provide a 128TB user-managed virtual address space to each process[3].
5.1.1 Pre-reserving virtual memory. Since virtual memory is abundant, we pre-allocate enough virtual memory space that is large enough to hold the KV-cache of the maximum batch size (configurable) that needs to be supported.
Number of virtual memory buffers: Each layer in an LLM maintains its own K and V tensors: we refer to them individually as K-cache and V-cache. We allocate separate virtual memory buffers for K-cache and V-cache. For a single GPU job, this requires pre-reserving 2 Γ π buffers where π is the number of layers in the model. In a multi-GPU job, each worker reserves 2 Γ π β² buffers where π β² is the number of layers managed by that worker (π β² = π with tensor-parallelism whereas π β² < π with pipeline-parallelism).
Size of a virtual memory buffer: The maximum size of a buffer is π΅π = π΅ Γ πΏ Γ π where B is the maximum batch size, L is the maximum context length supported by the model, and π is the size of a single tokenβs per-layer K-cache (or V-cache) on a worker. Further, π = π» Γ π· Γ π, where π» is the number of KV heads on a worker, π· is the dimension of each KV head and π is the number of bytes based on model precision (e.g., P=2 for FP16/BF16). Note that π is constant for a given model configuration.
Consider Yi-34B with FP16 and two-way tensor-parallelism (TP-2). In this case, π = 60, π» = 4, π· = 128, π = 2 (8 KV heads of Yi-34B are split evenly on two GPUs), and maximum supported context length πΏ = 200πΎ. For this model, the maximum size of K-cache (or V-cache) per-worker per-layer is π = 200ππ΅ (200πΎ β 4 β 128 β 2). Assuming π΅ = 500, the maximum size of each buffer per-worker is π΅π = 100πΊπ΅ (500 Γ 200ππ΅). Therefore, the total virtual memory requirement for 60 layers of Yi-34B is 120 buffers of 100GB each (12TB total). Note that the amount of virtual address space available grows with the number of GPUs e.g., with two TP workers, the amount of virtual address space available is 256TB. Therefore, virtual memory allocations can be satisfied easily.
5.1.2 On-demand physical memory allocation. vAttention preferentially allocates physical memory one page at a time and only when a request has used all of its previously allocated physical memory pages. To show how it works, we refer to a simple example in Figure 7. The example shows how vAttention manages the K-cache (or V-cache) at one layer of the model assuming maximum batch size of two. Rest of the K-cache and V-cache buffers are managed similarly at all layers.
This paper is available on arxiv under CC BY 4.0 DEED license.
[3] 64-bits systems typically utilize 48 bits for virtual addresses, providing a per-process virtual memory space of 256TB which is divided equally between the user space and (OS) kernel space.
Authors:
(1) Ramya Prabhu, Microsoft Research India;
(2) Ajay Nayak, Indian Institute of Science and Contributed to this work as an intern at Microsoft Research India;
(3) Jayashree Mohan, Microsoft Research India;
(4) Ramachandran Ramjee, Microsoft Research India;
(5) Ashish Panwar, Microsoft Research India.