Table of Links
-
Analysis
-
Experiments Results
-
Practical Inference Speedup Evaluation
A. Appendix / supplemental material
7 Practical Inference Speedup Evaluation
In this section, we evaluate the practical acceleration in model generation achieved. During the SFT phase, we incorporate a predictor module for each FFN block. Notably, for the TurboSparseMixtral-47B, we train predictors for each expert. When an expert is routed, the neuron-level predictor identifies which neurons will be activated, enabling neuron-level sparse computation.
We integrate our two models with PowerInfer, which is a state-of-the-art sparsely-activated framework to evaluate the actual generation speed.
Authors:
(1) Yixin Song, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;
(2) Haotong Xie, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;
(3) Zhengyan Zhang, Department of Computer Science and Technology, Tsinghua University;
(4) Bo Wen, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;
(5) Li Ma, Shanghai Artificial Intelligence Laboratory;
(6) Zeyu Mi, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University Mi yzmizeyu@sjtu.edu.cn);
(7) Haibo Chen, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University.
This paper is