Table of Links
-
Analysis
-
Experiments Results
-
Practical Inference Speedup Evaluation
A. Appendix / supplemental material
6.2 Sparsity of Sparsified Models
In this subsection, we report our models’ sparsity. We first profile the proportion of zero-valued activations for every layer with a general dataset(fineweb), as shown in Figure 6. By considering activations with a value of zero, we find that for TurboSparse-Mistral-7B, on average, has 90% of the neurons inactive in each layer. For TurboSparse-Mixtral-47B, this percentage is slightly lower at 85% on average for each expert FFN. Originally, Mixtral-47B would activate 2 out of 8 experts in each layer, introducing 75% sparsity, meaning only 25% of FLOPs needed to be computed. Furthermore, after ReLUfication, each expert will only activate 15% of neurons. Combining these, in inference, only 3% of parameters in each MoE layer will be activated.
Authors:
(1) Yixin Song, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;
(2) Haotong Xie, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;
(3) Zhengyan Zhang, Department of Computer Science and Technology, Tsinghua University;
(4) Bo Wen, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;
(5) Li Ma, Shanghai Artificial Intelligence Laboratory;
(6) Zeyu Mi, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University Mi yzmizeyu@sjtu.edu.cn);
(7) Haibo Chen, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University.
This paper is