TurboSparse Limitations: The Impact of 150B Token Recovery Training

Table of Links

B Limitation

Our models have only undergone continued training on 150B tokens. Compared to the 15T tokens used in pre-training for Llama-3 [60], the limited number of training tokens still results in some deficiencies in the model’s capabilities. We are optimistic that further training can help to mitigate these shortcomings.

Authors:

(1) Yixin Song, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;

(2) Haotong Xie, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;

(3) Zhengyan Zhang, Department of Computer Science and Technology, Tsinghua University;

(4) Bo Wen, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;

(5) Li Ma, Shanghai Artificial Intelligence Laboratory;

(6) Zeyu Mi, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University Mi yzmizeyu@sjtu.edu.cn);

(7) Haibo Chen, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University.

This paper is available on arxiv under CC BY 4.0 license.