Abstract and 1. Introduction

  1. Related Work and Background

  2. Analysis

    3.1 Limitations about Existing ReLUficatio

    3.2 dReLU

  3. Are Neurons in Expert still Sparsely Activated?

  4. dReLU Sparsification

  5. Experiments Results

    6.1 Downstream Tasks Performance

    6.2 Sparsity of Sparsified Models

  6. Practical Inference Speedup Evaluation

    7.1 Experiments Setting

    7.2 Pure CPU Inference and 7.3 Hybrid GPU-CPU Inference

    7.4 Deploy LLMs on mobile phones

  7. Conclusion and References

A. Appendix / supplemental material

B. Limitation

C. Broader Impact

Abstract

Exploiting activation sparsity is a promising approach to significantly accelerating the inference process of large language models (LLMs) without compromising performance. However, activation sparsity is determined by activation functions, and commonly used ones like SwiGLU and GeGLU exhibit limited sparsity. Simply replacing these functions with ReLU fails to achieve sufficient sparsity. Moreover, inadequate training data can further increase the risk of performance degradation. To address these challenges, we propose a novel dReLU function, which is designed to improve LLM activation sparsity, along with a high-quality training data mixture ratio to facilitate effective sparsification. Additionally, we leverage sparse activation patterns within the Feed-Forward Network (FFN) experts of Mixtureof-Experts (MoE) models to further boost efficiency. By applying our neuron sparsification method to the Mistral and Mixtral models, only 2.5 billion and 4.3 billion parameters are activated per inference iteration, respectively, while achieving even more powerful model performance. Evaluation results demonstrate that this sparsity achieves a 2-5× decoding speedup. Remarkably, on mobile phones, our TurboSparse-Mixtral-47B achieves an inference speed of 11 tokens per second. Our models are available at https://huggingface.co/PowerInfer.

1 Introduction

Large Language Models (LLMs) have achieved remarkable results, demonstrating emergent natural language abilities as the number of model parameters scales [9, 67]. These models have pushed the state-of-the-art performance across a wide range of downstream applications, such as QA and coding. However, most LLMs, such as Llama [60], Mistral [24], and Gemma [58], utilize all of their parameters during inference. These are known as dense models. The escalating demand for computational resources by dense models has become a significant barrier to the development of powerful and accessible AI, given the substantial costs involved.

To address the efficiency issues inherent in dense models, conditional computation [7, 6] has emerged as a crucial approach, which refers to activating part of the neurons in a network. There are two primary methods to achieve conditional computation. Mixture-of-Experts (MoE) [17, 31] is the first promising method, which introduces conditional computation by manually setting constraints on the model architecture prior to training, such as determining the number of experts to activate. This technique selectively activates specific parts of the model in response to particular inputs through a process known as expert routing, resulting in significant efficiency improvements. For instance, Switch Transformer [17] has scaled the model to the trillion-parameter level without increasing computational FLOPs significantly. Another promising method is utilizing the natural emergence of sparse activation due to the ReLU activation function [33], which naturally outputs zero elements that have no contribution in computation results. This activation sparsity presents a significant opportunity for efficient inference. Deja Vu [36] utilizes that sparsity exists in dense models due to ReLU to achieve 2× speedups. PowerInfer [56] achieving up to 11× speedups for deploying larger LLMs in a single consumer-grade GPU setting.

Recent LLMs typically prefer activation functions such as GELU [23] and Swish [50]. However, these functions do not significantly promote activation sparsity and are challenging to accelerate with conditional computation. To address this, ReLUfication [42], an existing state-of-the-art method, replaces the original activation function with ReLU and continues with pretraining. Despite its potential, this approach often struggles to achieve the desired levels of activation sparsity and may risk performance degradation [30, 59].

We argue that the failure of existing ReLUfication methods can be attributed to two main reasons. First, simply substituting SwiGLU with ReGLU is inefficient, as it only increases sparsity from 40% to around 70%. It suggests that a deeper investigation into the model architecture is necessary to achieve higher levels of sparsity. Second, the limited diversity of pretraining data and the insufficient number of training tokens in current approaches lead to incomplete capability recovery [42, 30]. As a result, expanding the diversity of pretraining datasets and increasing the number of training tokens are critical steps towards enhancing model performance.

To address these challenges, we first conduct a comprehensive analysis of the existing ReLUfication approach and identify that its shortcomings stem from the negative activations in the GLU component. Therefore, we propose an efficient activation function named dReLU. We apply dReLU in the pretraining of small-scale LLMs, alongside SwiGLU, and our findings indicate that LLMs using dReLU match the performance of those using SwiGLU, while also achieving close to 90% sparsity. Additionally, we collect a diverse range of pretraining corpora from the open-source community, including web, code, and mathematical datasets, to enhance the effectiveness of ReLUfication.

Meanwhile, we also conduct a sparsity analysis on MoE-based LLMs. Interestingly, we observe that the feed-forward networks (FFNs) within the experts remain sparsely activated, similar to the behavior exhibited by dense LLMs. This phenomenon suggests an opportunity to further accelerate inference speed by combining MoE techniques with ReLU-based sparse activation.

To validate the effectiveness of our proposed method, we implemented it on the Mistral-7B and Mixtral-47B models, converting them to TurboSparse-Mistral-47B and TurboSparse-Mixtral-47B, respectively. Extensive experiments across a wide range of downstream tasks demonstrate (Figure 1) that our enhanced models not only meet but often surpass the performance of their original counterparts.

Remarkably, in the TurboSparse-Mistral-7B model, we increase the average sparsity of the FFN to 90% while enhancing model capabilities. In MoE models, we further improve the sparsity in the TurboSparse-Mixtral-47B, originally introduced due to expert routing, from 75% to 97% by

incorporating sparse neuron activations. This substantial increase in sparsity significantly reduces FLOPs during the inference process.

Finally, we integrate our two new models with PowerInfer to evaluate the inference speed. Performance evaluation reveals that our models deliver an average 2.83× generation speedup.

The key contributions of this paper include:

• Efficient dReLU activation function: Our method utilizes fewer than 150B tokens, representing less than 1% of the typical pretraining tokens (commonly 15T tokens [11]).

• Sparse activated models: We will release our sparsely-activated TurboSparse-Mistral7B and TurboSparse-Mixtral-47B models. Both models demonstrate better performance compared to the original versions.

• Practical inference speedup: Evaluation shows that with our models, we can achieve a 2-5× speedup. Notably, we can achieve up to 10 tokens/s even without a GPU on TurboSparse-Mixtral-47B.

This paper is available on arxiv under CC BY 4.0 license.

Authors:

(1) Yixin Song, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;

(2) Haotong Xie, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;

(3) Zhengyan Zhang, Department of Computer Science and Technology, Tsinghua University;

(4) Bo Wen, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;

(5) Li Ma, Shanghai Artificial Intelligence Laboratory;

(6) Zeyu Mi, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University Mi yzmizeyu@sjtu.edu.cn);

(7) Haibo Chen, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University.