This story on HackerNoon has a decentralized backup on Sia.
Transaction ID: ofqa4-KNmPT2NaDbZ1MPkPAghtY4M8YjOvOoOAqBmI8
Cover

Llama 2 Finetuning Results: Multi-Token Prediction on Coding Benchmarks

Written by @largemodels | Published on 2025/6/10

TL;DR
This table evaluates the impact of multi-token prediction on Llama 2 fine-tuning, suggesting that it does not significantly improve performance on various tasks

Abstract and 1. Introduction

2. Method

3. Experiments on real data

3.1. Benefits scale with model size and 3.2. Faster inference

3.3. Learning global patterns with multi-byte prediction and 3.4. Searching for the optimal n

3.5. Training for multiple epochs and 3.6. Finetuning multi-token predictors

3.7. Multi-token prediction on natural language

4. Ablations on synthetic data and 4.1. Induction capability

4.2. Algorithmic reasoning

5. Why does it work? Some speculation and 5.1. Lookahead reinforces choice points

5.2. Information-theoretic argument

6. Related work

7. Conclusion, Impact statement, Environmental impact, Acknowledgements, and References

A. Additional results on self-speculative decoding

B. Alternative architectures

C. Training speeds

D. Finetuning

E. Additional results on model scaling behavior

F. Details on CodeContests finetuning

G. Additional results on natural language benchmarks

H. Additional results on abstractive text summarization

I. Additional results on mathematical reasoning in natural language

J. Additional results on induction learning

K. Additional results on algorithmic reasoning

L. Additional intuitions on multi-token prediction

M. Training hyperparameters

D. Finetuning

Table S6: Finetuning LLama 2 with multi-token prediction does not significantly improve performance. We tried to finetune LLama 2 with 4-token prediction but this did not yield significant improvements compared to the baseline. We suppose that this new loss changes the initialization too brutally and never really recovers. We still some improvements for example on MBPP Pass@1. All runs use 200B tokens of code.

This paper is available on arxiv under CC BY 4.0 DEED license.

Authors:

(1) Fabian Gloeckle, FAIR at Meta, CERMICS Ecole des Ponts ParisTech, and contributed equally;

(2) Badr Youbi IdrissiFAIR at Meta, LISN Université Paris-Saclay, and contributed equally;

(3) Baptiste Rozière, FAIR at Meta;

(4) David Lopez-Paz, FAIR at Meta and his the last author;

(5) Gabriel Synnaeve, FAIR at Meta and the last author.

[story continues]


Written by
@largemodels
The Large-ness of Large Language Models (LLMs) ushered in a technological revolution. We dissect the research.

Topics and
tags
multi-token-prediction|llama-2-finetuning|coding-benchmarks|mbpp|humaneval|llm-performance|transformer-finetuning|transformer-architecture
This story on HackerNoon has a decentralized backup on Sia.
Transaction ID: ofqa4-KNmPT2NaDbZ1MPkPAghtY4M8YjOvOoOAqBmI8