sia.hackernoon.com

We build vAttention as a Python library that internally uses a CUDA/C++ extension for interacting with CUDA drivers. Our library exposes a set of simple APIs listed in Table 4 to the serving framework.

5.3.1 Initial setup: When the serving framework starts, each model worker loads the vAttention library and configures it with model parameters 𝑁 ′ , 𝐻, 𝐷, 𝑃, 𝐵 and a preferred page size via the init API. Internally, vAttention reserves 2 × 𝑁 ′ virtual tensors (using our modified PyTorch caching allocator) for the KV-cache at each worker. These virtual tensors are reserved for the lifetime of the serving application. In addition, vAttention also pre-allocates physical memory pages during initialization. However, these pages are not mapped into the KV-cache yet.

5.3.2 Scheduling a new request: When a new request is scheduled for the first time, the serving framework obtains a new reqId from vAttention via alloc_reqid. All subsequent memory management operations of the request are tagged with this reqId.

5.3.3 Model execution: Before scheduling a batch for execution, the framework needs to ensure that the KV-cache sub-tensors of each active request are backed by physical memory. For this purpose, before dispatching the first kernel of an iteration to the GPU, the framework invokes the step API, specifying the current context length of each request (context length is set to 0 for each inactive reqId). Internally, vAttention ensures that enough physical pages are mapped for each active reqId before returning execution back to the framework. If vAttention cannot satisfy the memory demand, it returns with a failure in response to which a serving framework can preempt one or more requests to allow forward progress (this is similar to vLLM’s default behavior). We leave more sophisticated policies such as swapping out KV-cache to CPU memory as future work.

Depending on whether a request is in the prefill phase or decode phase, different number of physical memory pages may need to be mapped for a given iteration. The prefill phase processes the input tokens of given prompt in parallel and populates one slot in the K-cache (and V-cache) of the request at each layer of the model. Therefore, the number of pages needed to be mapped depends on the number of

prompt tokens being scheduled. If the total K-cache size of all prompt tokens at one layer of the model is 𝑠 and page size is 𝑡, then each worker needs to ensure that at least (𝑠 +𝑡 − 1)/𝑡 physical memory pages are mapped in each of the 2 × 𝑁 ′ KV-cache sub-tensors of the given reqId.

For a request in the decode phase, the number of new pages required is at most one per request. This is because each iteration produces only one output token for a request. vAttention internally tracks the number of pages mapped for each request and maps a new page only when the last page allocated to that request is fully utilized.

5.3.4 Request completion. A request terminates when a user specified context length or the maximum context length supported by the model is reached, or when the model produces a special end-of-sequence token. The framework notifies vAttention of a request’s completion with free_reqid. Internally, vAttention may unmap the pages of a completed request or defer them to be freed later.

This paper is available on arxiv under CC BY 4.0 DEED license.

Authors:

(1) Ramya Prabhu, Microsoft Research India;

(2) Ajay Nayak, Indian Institute of Science and Contributed to this work as an intern at Microsoft Research India;

(3) Jayashree Mohan, Microsoft Research India;

(4) Ramachandran Ramjee, Microsoft Research India;

(5) Ashish Panwar, Microsoft Research India.

Serving LLMs with vAttention: Workflow and API Integration

Table of Links

5.3 Serving LLMs with vAttention