Abstract and 1 Introduction

2 Background

2.1 Large Language Models

2.2 Fragmentation and PagedAttention

3 Issues with the PagedAttention Model and 3.1 Requires re-writing the attention kernel

3.2 Adds redundancy in the serving framework and 3.3 Performance Overhead

4 Insights into LLM Serving Systems

5 vAttention: System Design and 5.1 Design Overview

5.2 Leveraging Low-level CUDA Support

5.3 Serving LLMs with vAttention

6 vAttention: Optimizations and 6.1 Mitigating internal fragmentation

6.2 Hiding memory allocation latency

7 Evaluation

7.1 Portability and Performance for Prefills

7.2 Portability and Performance for Decodes

7.3 Efficacy of Physical Memory Allocation

7.4 Analysis of Memory Fragmentation

8 Related Work

9 Conclusion and References

7 Evaluation

Our evaluation seeks to answer the following questions:

• How does vAttention perform for prefill and decode phases in LLM inference? What are the portability and performance advantages of vAttention.

• How efficiently can vAttention allocate GPU memory for LLM serving workloads, and how effectively can it deal with KV-cache fragmentation?

Models and Hardware: We evaluate three models Yi-6B, Llama-3-8B and Yi-34B, using a single NVIDIA A100 GPU for Yi-6B, and two NVLink-connected A100 GPUs for Llama3-8B and Yi-34B (see Table 5). Each GPU has 80GB physical memory. We use tensor-parallelism degree of two (TP-2) for both Llama-3-8B and Yi-34B. All three models use GQA which is the most commonly used attention mechanism in recent LLMs.

Evaluation methodology: The computation and memory allocation pattern of the prefill and decode phases is substantially different. Attention kernels used for these two phases are also different and hence we evaluate them separately. The prefill phase requires one time memory allocation potentially spanning multiple pages. In comparison, the decode phase requires incremental memory allocation over the lifetime of a request. We measure the throughput of these phases in terms of tokens processed (or generated) per second.

This paper is available on arxiv under CC BY 4.0 DEED license.

Authors:

(1) Ramya Prabhu, Microsoft Research India;

(2) Ajay Nayak, Indian Institute of Science and Contributed to this work as an intern at Microsoft Research India;

(3) Jayashree Mohan, Microsoft Research India;

(4) Ramachandran Ramjee, Microsoft Research India;

(5) Ashish Panwar, Microsoft Research India.