Table of Links
2 Background
2.2 Fragmentation and PagedAttention
3 Issues with the PagedAttention Model and 3.1 Requires re-writing the attention kernel
3.2 Adds redundancy in the serving framework and 3.3 Performance Overhead
4 Insights into LLM Serving Systems
5 vAttention: System Design and 5.1 Design Overview
5.2 Leveraging Low-level CUDA Support
5.3 Serving LLMs with vAttention
6 vAttention: Optimizations and 6.1 Mitigating internal fragmentation
6.2 Hiding memory allocation latency
7.1 Portability and Performance for Prefills
7.2 Portability and Performance for Decodes
7.3 Efficacy of Physical Memory Allocation
7.4 Analysis of Memory Fragmentation
3.2 Adds redundancy in the serving framework
PagedAttention makes an LLM serving system responsible for managing the mappings between KV-cache and dynamically allocated memory blocks. For example, consider a request that allocates four KV-cache blocks over time (left half of Figure 2). These blocks are usually non-contiguous in virtual memory. During the computation of Equation 2, PagedAttention kernel needs to access all the elements of the four KV-cache blocks. To facilitate this, the serving system needs to track the virtual memory addresses of KV-cache blocks and pass them to the attention kernel at runtime. This approach effectively requires duplicating what the operating system already does for enabling virtual-to-physical address translation (right half Figure 2).
3.3 Performance Overhead
PagedAttention also leads to potential performance issues both on GPU and CPU. We investigate them separately
3.3.1 Runtime overhead on the GPU. PagedAttention slows down attention computation by adding extra code in the critical path. For example, vLLM acknowledges that their PagedAttention-based implementation was 20 − 26% slower than the original FasterTransformer kernel, primarily due to the overhead of looking up Block-Tables and executing extra branches [39]. Figure 3 shows that the paged decode kernel in FlashAttention is also slower than the vanilla kernel. Our further analysis reveals that the number of instructions executed in PagedAttention is up to 13% higher than the vanilla kernel. We also find that the overhead of paging reduces at high batch sizes or long context lengths. This is because computing attention for decodes is memory bound and when the KV-cache size is large, memory stalls hide the instruction overhead.
To highlight another example of difficulty involved in writing an efficient attention kernel, Figure 5 shows that the performance of vLLM’s paged decode kernel is significantly worse with larger block sizes of 64 and 128. Our analysis indicates that this is likely due to L1 cache efficiency: smaller blocks have a higher memory bandwidth utilization due to higher hit rates in L1 cache.
3.3.2 Runtime overhead on the CPU. Implementing an additional memory manager can add performance issues in the CPU runtime of the serving system. We refer to a few real-world examples and our own observations on vLLM to corroborate this argument.
To enable PagedAttention, a serving system needs to supply Block-Tables to the attention kernel. In vLLM, the latency of preparing a Block-Table depends on batch composition and grows proportional to max_num_blocks × batch_size where max_num_blocks refers to the number of KV-cache blocks in the longest request of the batch. This is because vLLM manages a Block-Table as a 2D tensor and aligns the number of KV-cache blocks in each request by padding unoccupied slots with zeros. If a batch contains a few long and many short requests, such padding results in a significant overhead. In our earlier experiments, we observed that BlockTable preparation in vLLM was contributing 30% latency in decode iterations. While a recent fix [19] has mitigated some of this overhead, we find that it can still be as high as 10%. High overhead of PagedAttention has also been found in TensorRT-LLM, degrading throughput by 11%, from 412 tokens/sec to 365 tokens/sec [14]. This issue was attributed to the Python runtime of TensorRT-LLM and moving to a C++ runtime can mitigate the CPU overhead.
Overall, this section shows that the PagedAttention model adds a significant programming burden while also being inefficient. vAttention introduces a more systematic approach to dynamic KV-cache memory management by leveraging the existing system support for demand paging. However, before delving into vAttention, we first highlight some of the fundamental characteristics of LLM serving workloads in terms of memory management.
This paper is available on arxiv under CC BY 4.0 DEED license.
Authors:
(1) Ramya Prabhu, Microsoft Research India;
(2) Ajay Nayak, Indian Institute of Science and Contributed to this work as an intern at Microsoft Research India;
(3) Jayashree Mohan, Microsoft Research India;
(4) Ramachandran Ramjee, Microsoft Research India;
(5) Ashish Panwar, Microsoft Research India.