Sia HackewrNoon

Figure S10: Decoding speeds and latencies with self-speculative decoding relative to standard autoregressive decoding. We use k heads of a 4-token prediction model and evaluate decoding speeds of a code model as explained in Table S2. All numbers are relative to the autoregressive (k = 1) baseline with the same batch size.

Table S3: Relative speedups with self-speculative decoding with byte-level models on code. We prompt the 7B parameter models from Section 3.3 on 4096 sequences of 1024 bytes of code not seen during training, and generate completions consisting of 1024 bytes using greedy self-speculative decoding (Stern et al., 2018) as in Table S2. The speedup was evaluated at a batch size of 16.

Authors:

(1) Fabian Gloeckle, FAIR at Meta, CERMICS Ecole des Ponts ParisTech and Equal contribution;

(2) Badr Youbi Idrissi, FAIR at Meta, LISN Université Paris-Saclayand and Equal contribution;

(3) Baptiste Rozière, FAIR at Meta;

(4) David Lopez-Paz, FAIR at Meta and a last author;

(5) Gabriel Synnaeve, FAIR at Meta and a last author.

This paper is available on arxiv under CC BY 4.0 DEED license.

Unleashing LLM Speed: Multi-Token Self-Speculative Decoding Redefines Inference

Table of Links

A. Additional results on self-speculative decoding