Table of Links
2 Background
2.2 Fragmentation and PagedAttention
3 Issues with the PagedAttention Model and 3.1 Requires re-writing the attention kernel
3.2 Adds redundancy in the serving framework and 3.3 Performance Overhead
4 Insights into LLM Serving Systems
5 vAttention: System Design and 5.1 Design Overview
5.2 Leveraging Low-level CUDA Support
5.3 Serving LLMs with vAttention
6 vAttention: Optimizations and 6.1 Mitigating internal fragmentation
6.2 Hiding memory allocation latency
7.1 Portability and Performance for Prefills
7.2 Portability and Performance for Decodes
7.3 Efficacy of Physical Memory Allocation
7.4 Analysis of Memory Fragmentation
5.2 Leveraging Low-level CUDA Support
The standard GPU memory allocation interface cudaMalloc does not support demand paging i.e., it allocates virtual memory and physical memory at the same time. However, recent CUDA versions provide programmers a fine-grained control over virtual and physical memory [17, 35]. We leverage these low-level APIs in vAttention.
5.2.1 CUDA virtual memory APIs. Table 3 provides a high-level overview of CUDA APIs that allow separating the allocation of virtual memory from physical memory (see the leftmost column). The allocation granularity depends on the page size used by the GPU and the size of virtual memory
buffer or a physical memory handle must be a multiple of the allocation granularity. Different sub-regions of a virtual memory buffer can be backed by physical memory independently of other sub-regions in that buffer (see Figure 7c for an example). For simplicity, we refer to the granularity at which physical memory is allocated as page size.
5.2.2 Extending PyTorch caching allocator: KV-cache is a collection of tensors. In current deep learning frameworks such as PyTorch, a tensor allocated via APIs such as torch.empty comes with pre-allocated physical memory. This is because the PyTorch caching allocator uses the cudaMalloc interface to allocate GPU memory (both virtual and physical). Relying on the low-level API support from CUDA, we extend the PyTorch caching allocator to allow an application to reserve a virtual memory buffer for a tensor without committing physical memory ahead-of-time. We refer to tensors allocated via these APIs as virtual tensors.
5.2.3 Request-level KV-cache indexing: Note that each virtual tensor represents the K-cache (or V-cache) of a layer for the maximum batch size B. In these tensors, different requests occupy different non-overlapping sub-regions (say sub-tensors). We locate the sub-tensor of a request with a unique integer identifier reqId that lies in the range of 0 to π΅ β 1 (note that at most π΅ requests run simultaneously). The K-cache (or V-cache) offset of a requestβs sub-tensor in the virtual tensor of the entire batch is reqId Γ π where π is the maximum K-cache (or V-cache) size of a request on a worker. The request identifier reqId is allocated by vAttention.
This paper is available on arxiv under CC BY 4.0 DEED license.
Authors:
(1) Ramya Prabhu, Microsoft Research India;
(2) Ajay Nayak, Indian Institute of Science and Contributed to this work as an intern at Microsoft Research India;
(3) Jayashree Mohan, Microsoft Research India;
(4) Ramachandran Ramjee, Microsoft Research India;
(5) Ashish Panwar, Microsoft Research India.