sia.hackernoon.com

Models and Hardware: We evaluate three models Yi-6B, Llama-3-8B and Yi-34B, using a single NVIDIA A100 GPU for Yi-6B, and two NVLink-connected A100 GPUs for Llama3-8B and Yi-34B (see Table 5). Each GPU has 80GB physical memory. We use tensor-parallelism degree of two (TP-2) for both Llama-3-8B and Yi-34B. All three models use GQA which is the most commonly used attention mechanism in recent LLMs.

Evaluation methodology: The computation and memory allocation pattern of the prefill and decode phases is substantially different. Attention kernels used for these two phases are also different and hence we evaluate them separately. The prefill phase requires one time memory allocation potentially spanning multiple pages. In comparison, the decode phase requires incremental memory allocation over the lifetime of a request. We measure the throughput of these phases in terms of tokens processed (or generated) per second.

This paper is available on arxiv under CC BY 4.0 DEED license.

Authors:

(1) Ramya Prabhu, Microsoft Research India;

(2) Ajay Nayak, Indian Institute of Science and Contributed to this work as an intern at Microsoft Research India;

(3) Jayashree Mohan, Microsoft Research India;

(4) Ramachandran Ramjee, Microsoft Research India;

(5) Ashish Panwar, Microsoft Research India.

Evaluation of vAttention for LLM Inference: Prefill and Decode Performance

Table of Links

7 Evaluation