Abstract and 1. Introduction

  1. Related Work and Background

  2. Analysis

    3.1 Limitations about Existing ReLUficatio

    3.2 dReLU

  3. Are Neurons in Expert still Sparsely Activated?

  4. dReLU Sparsification

  5. Experiments Results

    6.1 Downstream Tasks Performance

    6.2 Sparsity of Sparsified Models

  6. Practical Inference Speedup Evaluation

    7.1 Experiments Setting

    7.2 Pure CPU Inference and 7.3 Hybrid GPU-CPU Inference

    7.4 Deploy LLMs on mobile phones

  7. Conclusion and References

A. Appendix / supplemental material

B. Limitation

C. Broader Impact

6.2 Sparsity of Sparsified Models

In this subsection, we report our models’ sparsity. We first profile the proportion of zero-valued activations for every layer with a general dataset(fineweb), as shown in Figure 6. By considering activations with a value of zero, we find that for TurboSparse-Mistral-7B, on average, has 90% of the neurons inactive in each layer. For TurboSparse-Mixtral-47B, this percentage is slightly lower at 85% on average for each expert FFN. Originally, Mixtral-47B would activate 2 out of 8 experts in each layer, introducing 75% sparsity, meaning only 25% of FLOPs needed to be computed. Furthermore, after ReLUfication, each expert will only activate 15% of neurons. Combining these, in inference, only 3% of parameters in each MoE layer will be activated.

Authors:

(1) Yixin Song, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;

(2) Haotong Xie, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;

(3) Zhengyan Zhang, Department of Computer Science and Technology, Tsinghua University;

(4) Bo Wen, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;

(5) Li Ma, Shanghai Artificial Intelligence Laboratory;

(6) Zeyu Mi, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University Mi yzmizeyu@sjtu.edu.cn);

(7) Haibo Chen, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University.


This paper is available on arxiv under CC BY 4.0 license.