Sia HackewrNoon

Standard language modeling learns about a large text corpus x1, . . . xT by implementing a next-token prediction task. Formally, the learning objective is to minimize the cross-entropy loss

In this work, we generalize the above by implementing a multi-token prediction task, where at each position of the training corpus, the model is instructed to predict n future tokens at once. This translates into the cross-entropy loss

Figure 2: Order of the forward/backward in an n-token prediction model with n = 2 heads. By performing the forward/backward on the heads in sequential order, we avoidmaterializing all unembedding layer gradients in memory simultaneously and reduce peak GPU memory usage.

Authors:

(1) Fabian Gloeckle, FAIR at Meta, CERMICS Ecole des Ponts ParisTech and Equal contribution;

(2) Badr Youbi Idrissi, FAIR at Meta, LISN Université Paris-Saclayand and Equal contribution;

(3) Baptiste Rozière, FAIR at Meta;

(4) David Lopez-Paz, FAIR at Meta and a last author;

(5) Gabriel Synnaeve, FAIR at Meta and a last author.

This paper is available on arxiv under CC BY 4.0 DEED license.

Optimizing LLM Learning: Multi-Token Cross-Entropy Loss Explained

Table of Links

2. Method