TurboSparse Inference Speedup: PowerInfer Integration for Real-Time LLM Decoding

Table of Links

7 Practical Inference Speedup Evaluation

In this section, we evaluate the practical acceleration in model generation achieved. During the SFT phase, we incorporate a predictor module for each FFN block. Notably, for the TurboSparseMixtral-47B, we train predictors for each expert. When an expert is routed, the neuron-level predictor identifies which neurons will be activated, enabling neuron-level sparse computation.

We integrate our two models with PowerInfer, which is a state-of-the-art sparsely-activated framework to evaluate the actual generation speed.

Authors:

(1) Yixin Song, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;

(2) Haotong Xie, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;

(3) Zhengyan Zhang, Department of Computer Science and Technology, Tsinghua University;

(4) Bo Wen, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;

(5) Li Ma, Shanghai Artificial Intelligence Laboratory;

(6) Zeyu Mi, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University Mi yzmizeyu@sjtu.edu.cn);

(7) Haibo Chen, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University.

This paper is available on arxiv under CC BY 4.0 license.