ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models

Read original: arXiv:2402.13516 - Published 7/4/2024 by Chenyang Song, Xu Han, Zhengyan Zhang, Shengding Hu, Xiyu Shi, Kuai Li, Chen Chen, Zhiyuan Liu, Guangli Li, Tao Yang and 1 other

ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models

Overview

This paper introduces ProSparse, a technique to enhance the intrinsic activation sparsity within large language models (LLMs).
Activation sparsity refers to the proportion of zero-valued activations in the hidden layers of neural networks, which can be exploited to improve the efficiency of LLMs.
The authors demonstrate how ProSparse can be used to enable high sparsity levels in foundational LLMs like LLAMA, while maintaining model performance.
The paper also discusses related techniques like CATS and dynamic activation and how ProSparse compares to these approaches.

Plain English Explanation

The paper focuses on a technique called ProSparse that can make large language models more efficient. Large language models are AI systems that can understand and generate human-like text, but they can also be very computationally intensive to run.

ProSparse works by taking advantage of the fact that in these large models, many of the internal calculations (called "activations") result in values of zero. By identifying and removing these zero-valued activations, ProSparse can make the models run faster and use less memory without significantly impacting their performance.

The authors show how ProSparse can be used to create highly sparse (i.e., with many zero-valued activations) versions of foundational language models like LLAMA. This means these models can be deployed on a wider range of hardware, including less powerful devices.

The paper also compares ProSparse to other techniques, like CATS and dynamic activation, that also aim to make language models more efficient. It discusses how ProSparse differs from and builds upon these approaches.

Technical Explanation

The key contribution of this paper is the introduction of ProSparse, a novel technique for enhancing the intrinsic activation sparsity within large language models (LLMs). Activation sparsity refers to the proportion of zero-valued activations in the hidden layers of neural networks, which can be exploited to improve the efficiency of LLMs.

The authors demonstrate how ProSparse can be used to enable high sparsity levels in foundational LLMs like LLAMA, while maintaining model performance. ProSparse works by introducing a custom activation function and a novel regularization term that encourages sparse activations during training.

The paper also discusses related techniques like CATS, which uses a context-aware thresholding mechanism to induce sparsity, and dynamic activation, which adaptively adjusts the activation function during inference. The authors compare and contrast these approaches, highlighting the unique benefits of ProSparse.

Critical Analysis

The paper provides a comprehensive and well-designed study on enhancing intrinsic activation sparsity in LLMs using the ProSparse technique. The authors thoroughly evaluate ProSparse's performance across a range of benchmark tasks and demonstrate its effectiveness in enabling high sparsity levels while maintaining model accuracy.

One potential limitation of the study is the lack of analysis on the runtime and memory footprint benefits of ProSparse-enabled models. While the paper focuses on achieving high sparsity levels, it would be valuable to quantify the practical implications of these sparsity levels in terms of real-world deployment, especially on resource-constrained devices.

Additionally, the paper could have delved deeper into the underlying mechanisms and theoretical foundations of ProSparse. A more in-depth discussion of how the custom activation function and regularization term work to encourage sparse activations would strengthen the technical understanding of the approach.

Furthermore, the authors could have explored the potential trade-offs or limitations of ProSparse, such as whether the technique is universally applicable across different LLM architectures or if there are specific conditions or constraints under which it may be less effective. Evaluating the approach on a wider range of LLMs would also help to better understand the generalizability of ProSparse.

Overall, the paper presents a valuable contribution to the field of efficient LLM design, and the ProSparse technique shows promise as a method for improving the computational efficiency of large-scale language models. Further research and real-world application of this approach could yield significant benefits for the deployment of LLMs in resource-constrained environments.

Conclusion

The ProSparse paper introduces a novel technique for enhancing the intrinsic activation sparsity within large language models (LLMs). By leveraging a custom activation function and a novel regularization term, ProSparse can enable high sparsity levels in foundational LLMs like LLAMA without significantly compromising model performance.

This work builds upon and advances existing approaches like CATS and dynamic activation, offering a unique and promising solution for improving the efficiency of LLMs. As the demands for powerful language models continue to grow, techniques like ProSparse that can enhance computational efficiency while preserving model capabilities will become increasingly important for enabling the widespread deployment of LLMs, including in resource-constrained neuromorphic computing environments.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models

Chenyang Song, Xu Han, Zhengyan Zhang, Shengding Hu, Xiyu Shi, Kuai Li, Chen Chen, Zhiyuan Liu, Guangli Li, Tao Yang, Maosong Sun

Activation sparsity refers to the existence of considerable weakly-contributed elements among activation outputs. As a prevalent property of the models using the ReLU activation function, activation sparsity has been proven a promising paradigm to boost model inference efficiency. Nevertheless, most large language models (LLMs) adopt activation functions without intrinsic activation sparsity (e.g., GELU and Swish). Some recent efforts have explored introducing ReLU or its variants as the substitutive activation function to help LLMs achieve activation sparsity and inference acceleration, but few can simultaneously obtain high sparsity and comparable model performance. This paper introduces a simple and effective sparsification method named ProSparse to push LLMs for higher activation sparsity while maintaining comparable performance. Specifically, after substituting the activation function of LLMs with ReLU, ProSparse adopts progressive sparsity regularization with a factor smoothly increasing along the multi-stage sine curves. This can enhance activation sparsity and mitigate performance degradation by avoiding radical shifts in activation distributions. With ProSparse, we obtain high sparsity of 89.32% for LLaMA2-7B, 88.80% for LLaMA2-13B, and 87.89% for end-size MiniCPM-1B, respectively, achieving comparable performance to their original Swish-activated versions. These present the most sparsely activated models among open-source LLaMA versions and competitive end-size models, considerably surpassing ReluLLaMA-7B (66.98%) and ReluLLaMA-13B (71.56%). Our inference acceleration experiments further demonstrate the significant practical acceleration potential of LLMs with higher activation sparsity, obtaining up to 4.52$times$ inference speedup.

7/4/2024

Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters

Yixin Song, Haotong Xie, Zhengyan Zhang, Bo Wen, Li Ma, Zeyu Mi, Haibo Chen

Exploiting activation sparsity is a promising approach to significantly accelerating the inference process of large language models (LLMs) without compromising performance. However, activation sparsity is determined by activation functions, and commonly used ones like SwiGLU and GeGLU exhibit limited sparsity. Simply replacing these functions with ReLU fails to achieve sufficient sparsity. Moreover, inadequate training data can further increase the risk of performance degradation. To address these challenges, we propose a novel dReLU function, which is designed to improve LLM activation sparsity, along with a high-quality training data mixture ratio to facilitate effective sparsification. Additionally, we leverage sparse activation patterns within the Feed-Forward Network (FFN) experts of Mixture-of-Experts (MoE) models to further boost efficiency. By applying our neuron sparsification method to the Mistral and Mixtral models, only 2.5 billion and 4.3 billion parameters are activated per inference iteration, respectively, while achieving even more powerful model performance. Evaluation results demonstrate that this sparsity achieves a 2-5x decoding speedup. Remarkably, on mobile phones, our TurboSparse-Mixtral-47B achieves an inference speed of 11 tokens per second. Our models are available at url{https://huggingface.co/PowerInfer}

6/12/2024

Training-Free Activation Sparsity in Large Language Models

James Liu, Pragaash Ponnusamy, Tianle Cai, Han Guo, Yoon Kim, Ben Athiwaratkun

Activation sparsity can enable practical inference speedups in large language models (LLMs) by reducing the compute and memory-movement required for matrix multiplications during the forward pass. However, existing methods face limitations that inhibit widespread adoption. Some approaches are tailored towards older models with ReLU-based sparsity, while others require extensive continued pre-training on up to hundreds of billions of tokens. This paper describes TEAL, a simple training-free method that applies magnitude-based activation sparsity to hidden states throughout the entire model. TEAL achieves 40-50% model-wide sparsity with minimal performance degradation across Llama-2, Llama-3, and Mistral families, with sizes varying from 7B to 70B. We improve existing sparse kernels and demonstrate wall-clock decoding speed-ups of up to 1.53$times$ and 1.8$times$ at 40% and 50% model-wide sparsity. TEAL is compatible with weight quantization, enabling further efficiency gains.

8/28/2024

Achieving Sparse Activation in Small Language Models

Jifeng Song, Kai Huang, Xiangyu Yin, Boyuan Yang, Wei Gao

Sparse activation, which selectively activates only an input-dependent set of neurons in inference, is a useful technique to reduce the computing cost of Large Language Models (LLMs) without retraining or adaptation efforts. However, whether it can be applied to the recently emerging Small Language Models (SLMs) remains questionable, because SLMs are generally less over-parameterized than LLMs. In this paper, we aim to achieve sparse activation in SLMs. We first show that the existing sparse activation schemes in LLMs that build on neurons' output magnitudes cannot be applied to SLMs, and activating neurons based on their attribution scores is a better alternative. Further, we demonstrated and quantified the large errors of existing attribution metrics when being used for sparse activation, due to the interdependency among attribution scores of neurons across different layers. Based on these observations, we proposed a new attribution metric that can provably correct such errors and achieve precise sparse activation. Experiments over multiple popular SLMs and datasets show that our approach can achieve 80% sparsification ratio with <5% model accuracy loss, comparable to the sparse activation achieved in LLMs. The source code is available at: https://github.com/pittisl/Sparse-Activation.

6/12/2024