Table of Links
2 Background
2.2 Fragmentation and PagedAttention
3 Issues with the PagedAttention Model and 3.1 Requires re-writing the attention kernel
3.2 Adds redundancy in the serving framework and 3.3 Performance Overhead
4 Insights into LLM Serving Systems
5 vAttention: System Design and 5.1 Design Overview
5.2 Leveraging Low-level CUDA Support
5.3 Serving LLMs with vAttention
6 vAttention: Optimizations and 6.1 Mitigating internal fragmentation
6.2 Hiding memory allocation latency
7.1 Portability and Performance for Prefills
7.2 Portability and Performance for Decodes
7.3 Efficacy of Physical Memory Allocation
7.4 Analysis of Memory Fragmentation
2.1 Large Language Models
Given an input sequence, an LLM predicts the probability of an output sequence wherein a sequence is a set of tokens [39]. Each inference request begins with a prefill phase that processes all its prompt tokens in parallel. The prefill phase produces the first output token of a request. Thereafter, the decode phase iteratively processes the output token generated in the previous step and produces the next output token in every iteration [26].
LLMs are built atop one of variants of the transformer architecture [44]. A transformer block contains two types of operators: position-wise and sequence-wise. The former category includes feed-forward network, layer normalization, activation, embedding layer, output sampling layer, and residual connections whereas attention is a sequence-level operator. In this paper, we primarily focus on attention since it is the primary consumer of GPU memory in LLM inference.
Structure of the KV-cache and terminology: An LLM consists of multiple layers of the transformer block and each layer maintains its own cache of keys and values. In this paper, we refer to the cache of all transformer blocks collectively as KV-cache while using the term K-cache or V-cache for keys and values, respectively. In deep learning frameworks, the K-cache (or V-cache) at each layer is typically represented as a 4D tensor of shape [𝐵, 𝐿, 𝐻, 𝐷] where 𝐵 refers to batch size and 𝐿 refers to the maximum possible context length of a request. We refer to a kernel implementation that computes attention scores over contiguously stored K and V as a vanilla kernel.
This paper is available on arxiv under CC BY 4.0 DEED license.
[2] An exception to this is sliding window attention that fixes an upper bound on how many of the prior tokens are used in computing attention.
Authors:
(1) Ramya Prabhu, Microsoft Research India;
(2) Ajay Nayak, Indian Institute of Science and Contributed to this work as an intern at Microsoft Research India;
(3) Jayashree Mohan, Microsoft Research India;
(4) Ramachandran Ramjee, Microsoft Research India;
(5) Ashish Panwar, Microsoft Research India.