Table of Links
2 Background
2.2 Fragmentation and PagedAttention
3 Issues with the PagedAttention Model and 3.1 Requires re-writing the attention kernel
3.2 Adds redundancy in the serving framework and 3.3 Performance Overhead
4 Insights into LLM Serving Systems
5 vAttention: System Design and 5.1 Design Overview
5.2 Leveraging Low-level CUDA Support
5.3 Serving LLMs with vAttention
6 vAttention: Optimizations and 6.1 Mitigating internal fragmentation
6.2 Hiding memory allocation latency
7.1 Portability and Performance for Prefills
7.2 Portability and Performance for Decodes
7.3 Efficacy of Physical Memory Allocation
7.4 Analysis of Memory Fragmentation
6 vAttention: Optimizations
There are two primary challenges in using CUDA’s virtual memory support for serving LLMs. First, cuMemCreate currently allocates a minimum of 2MB physical memory page. Large pages can waste physical memory due to internal fragmentation. Second, invoking CUDA APIs incurs high latency. This section details a set of simple-yet-effective optimizations that we introduce to overcome these limitations.
6.1 Mitigating internal fragmentation
We mitigate internal fragmentation by reducing the granularity of physical memory allocation. NVIDIA GPUs natively support at least three page sizes: 4KB, 64KB and 2MB. Therefore, in principal, physical memory can be allocated in any multiple of 4KB sizes. The simplest way to achieve this would be to extend the existing CUDA virtual memory APIs (listed in Table 3) to also support allocating smaller pages (similar to how mmap in Linux supports multiple page sizes). Unfortunately, the CUDA APIs are implemented in the closed-source NVIDIA drivers which makes it impossible for us to modify their implementation.
Fortunately, some part of NVIDIA drivers (particularly related to unified memory management) is open-source. Therefore, we implement a new set of APIs in the open-source NVIDIA drivers to mimic the same functionality that existing CUDA APIs provide but with support for multiple page sizes. The second column in Table 3 shows our new APIs: most of our APIs have a one-to-one relationship with existing CUDA APIs except for vMemMap that combines the functionality of cuMemMap and cuMemSetAccess, and vMemRelease that combines the functionality of cuMemUnmap and cuMemRelease for simplicity. In contrast to CUDA APIs, our APIs can allocate memory in 64KB, 128KB and 256KB page sizes. A serving framework can configure a desired page size in vAttention while initializing it: we recommend using 256KB pages by default. The last set of columns in Table 3 shows the latency of each API with different page sizes.
This paper is available on arxiv under CC BY 4.0 DEED license.
Authors:
(1) Ramya Prabhu, Microsoft Research India;
(2) Ajay Nayak, Indian Institute of Science and Contributed to this work as an intern at Microsoft Research India;
(3) Jayashree Mohan, Microsoft Research India;
(4) Ramachandran Ramjee, Microsoft Research India;
(5) Ashish Panwar, Microsoft Research India.