Achieving Sparse Activation in Small Language Models

2406.06562

Published 6/12/2024 by Jifeng Song, Kai Huang, Xiangyu Yin, Boyuan Yang, Wei Gao

Achieving Sparse Activation in Small Language Models

Abstract

Sparse activation, which selectively activates only an input-dependent set of neurons in inference, is a useful technique to reduce the computing cost of Large Language Models (LLMs) without retraining or adaptation efforts. However, whether it can be applied to the recently emerging Small Language Models (SLMs) remains questionable, because SLMs are generally less over-parameterized than LLMs. In this paper, we aim to achieve sparse activation in SLMs. We first show that the existing sparse activation schemes in LLMs that build on neurons' output magnitudes cannot be applied to SLMs, and activating neurons based on their attribution scores is a better alternative. Further, we demonstrated and quantified the large errors of existing attribution metrics when being used for sparse activation, due to the interdependency among attribution scores of neurons across different layers. Based on these observations, we proposed a new attribution metric that can provably correct such errors and achieve precise sparse activation. Experiments over multiple popular SLMs and datasets show that our approach can achieve 80% sparsification ratio with <5% model accuracy loss, comparable to the sparse activation achieved in LLMs. The source code is available at: https://github.com/pittisl/Sparse-Activation.

Create account to get full access

Overview

This paper explores techniques for achieving sparse activation in small language models, which can lead to more efficient and compact models.
The researchers investigate different methods to encourage sparsity, including ProSparse, Learn to be Efficient, Turbo Sparse, and Sparsity Accelerated Training.
They also explore Contextually Aware Thresholding for Sparsity (CATS) to enable sparse activation patterns that are tuned to the specific input.

Plain English Explanation

The paper focuses on making small language models more efficient and compact by encouraging "sparse activation." This means that instead of all the neurons in the model firing at once, only a small subset of them are active at any given time. This can lead to significant reductions in the model size and computational requirements.

The researchers test out different techniques to promote this sparse activation. One method, called ProSparse, works to increase the natural sparsity that already exists in the model. Another, Learn to be Efficient, actually trains the model to build in a structured sparsity pattern. Turbo Sparse aims to achieve state-of-the-art performance with minimal model size, while Sparsity Accelerated Training speeds up the training process for large language models.

The paper also explores Contextually Aware Thresholding for Sparsity (CATS), which allows the model to adjust its sparsity pattern based on the specific input it's processing. This lets the model focus its limited computational resources on the most relevant parts of the input.

By using these various techniques, the researchers were able to create small, efficient language models that maintain high performance despite having far fewer active neurons than a traditional model.

Technical Explanation

The paper investigates several methods for encouraging sparse activation in small language models. ProSparse aims to increase the natural sparsity that arises during training by adding a sparsity-inducing regularizer. Learn to be Efficient takes a more direct approach, training the model to learn a structured sparsity pattern.

Turbo Sparse combines several techniques to achieve state-of-the-art performance with a minimal model size, while Sparsity Accelerated Training speeds up the training process for large language models by encouraging sparsity.

The paper also introduces Contextually Aware Thresholding for Sparsity (CATS), which allows the model to dynamically adjust its sparsity pattern based on the input. This enables the model to focus its limited computational resources on the most relevant parts of the input.

The researchers evaluate these techniques on a range of language modeling tasks, demonstrating significant reductions in model size and computational requirements while maintaining high performance.

Critical Analysis

The paper presents a comprehensive exploration of techniques for achieving sparse activation in small language models. The researchers thoroughly evaluate the various methods and provide insights into the tradeoffs and considerations for each approach.

One potential limitation is the focus on small language models, which may limit the applicability of the findings to larger, more complex models. Additionally, the paper does not delve deeply into the underlying mechanisms or theoretical justifications for the observed sparsity patterns.

While the results are promising, further research is needed to fully understand the implications of these sparse activation techniques and their broader impact on language model design and deployment. It would be valuable to explore how these methods scale to different model sizes and tasks, as well as to investigate any potential downstream effects on model robustness, interpretability, or fairness.

Conclusion

This paper makes a significant contribution to the field of efficient language model design by exploring various techniques for achieving sparse activation in small language models. The researchers demonstrate that by encouraging sparsity, it is possible to create compact, computationally efficient models without sacrificing performance.

The findings have important implications for the development of energy-efficient and resource-constrained language models, which are crucial for applications in edge computing, mobile devices, and resource-limited environments. The techniques presented in this paper pave the way for more sustainable and accessible natural language processing capabilities in a wide range of real-world scenarios.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models

Chenyang Song, Xu Han, Zhengyan Zhang, Shengding Hu, Xiyu Shi, Kuai Li, Chen Chen, Zhiyuan Liu, Guangli Li, Tao Yang, Maosong Sun

Activation sparsity refers to the existence of considerable weakly-contributed elements among activation outputs. As a prevalent property of the models using the ReLU activation function, activation sparsity has been proven a promising paradigm to boost model inference efficiency. Nevertheless, most large language models (LLMs) adopt activation functions without intrinsic activation sparsity (e.g., GELU and Swish). Some recent efforts have explored introducing ReLU or its variants as the substitutive activation function to help LLMs achieve activation sparsity and inference acceleration, but few can simultaneously obtain high sparsity and comparable model performance. This paper introduces a simple and effective sparsification method named ProSparse to push LLMs for higher activation sparsity while maintaining comparable performance. Specifically, after substituting the activation function of LLMs with ReLU, ProSparse adopts progressive sparsity regularization with a factor smoothly increasing along the multi-stage sine curves. This can enhance activation sparsity and mitigate performance degradation by avoiding radical shifts in activation distributions. With ProSparse, we obtain high sparsity of 89.32% for LLaMA2-7B, 88.80% for LLaMA2-13B, and 87.89% for end-size MiniCPM-1B, respectively, achieving comparable performance to their original Swish-activated versions. These present the most sparsely activated models among open-source LLaMA versions and competitive end-size models, considerably surpassing ReluLLaMA-7B (66.98%) and ReluLLaMA-13B (71.56%). Our inference acceleration experiments further demonstrate the significant practical acceleration potential of LLMs with higher activation sparsity, obtaining up to 4.52$times$ inference speedup.

5/28/2024

cs.LG cs.AI cs.CL

Learn To be Efficient: Build Structured Sparsity in Large Language Models

Haizhong Zheng, Xiaoyan Bai, Xueshen Liu, Z. Morley Mao, Beidi Chen, Fan Lai, Atul Prakash

Large Language Models (LLMs) have achieved remarkable success with their billion-level parameters, yet they incur high inference overheads. The emergence of activation sparsity in LLMs provides a natural approach to reduce this cost by involving only parts of the parameters for inference. However, existing methods only focus on utilizing this naturally formed activation sparsity in a post-training setting, overlooking the potential for further amplifying this inherent sparsity. In this paper, we hypothesize that LLMs can learn to be efficient by achieving more structured activation sparsity. To achieve this, we introduce a novel training algorithm, Learn-To-be-Efficient (LTE), designed to train efficiency-aware LLMs to learn to activate fewer neurons and achieve a better trade-off between sparsity and performance. Furthermore, unlike SOTA MoEfication methods, which mainly focus on ReLU-based models, LTE can also be applied to LLMs like LLaMA using non-ReLU activations. Extensive evaluation on language understanding, language generation, and instruction tuning tasks show that LTE consistently outperforms SOTA baselines. Along with our hardware-aware custom kernel implementation, LTE reduces LLaMA2-7B inference latency by 25% at 50% sparsity.

6/5/2024

cs.CL cs.AI cs.LG

Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters

Yixin Song, Haotong Xie, Zhengyan Zhang, Bo Wen, Li Ma, Zeyu Mi, Haibo Chen

Exploiting activation sparsity is a promising approach to significantly accelerating the inference process of large language models (LLMs) without compromising performance. However, activation sparsity is determined by activation functions, and commonly used ones like SwiGLU and GeGLU exhibit limited sparsity. Simply replacing these functions with ReLU fails to achieve sufficient sparsity. Moreover, inadequate training data can further increase the risk of performance degradation. To address these challenges, we propose a novel dReLU function, which is designed to improve LLM activation sparsity, along with a high-quality training data mixture ratio to facilitate effective sparsification. Additionally, we leverage sparse activation patterns within the Feed-Forward Network (FFN) experts of Mixture-of-Experts (MoE) models to further boost efficiency. By applying our neuron sparsification method to the Mistral and Mixtral models, only 2.5 billion and 4.3 billion parameters are activated per inference iteration, respectively, while achieving even more powerful model performance. Evaluation results demonstrate that this sparsity achieves a 2-5x decoding speedup. Remarkably, on mobile phones, our TurboSparse-Mixtral-47B achieves an inference speed of 11 tokens per second. Our models are available at url{https://huggingface.co/PowerInfer}

6/12/2024

cs.LG cs.CL

Sparsity-Accelerated Training for Large Language Models

Da Ma, Lu Chen, Pengyu Wang, Hongshen Xu, Hanqi Li, Liangtai Sun, Su Zhu, Shuai Fan, Kai Yu

Large language models (LLMs) have demonstrated proficiency across various natural language processing (NLP) tasks but often require additional training, such as continual pre-training and supervised fine-tuning. However, the costs associated with this, primarily due to their large parameter count, remain high. This paper proposes leveraging emph{sparsity} in pre-trained LLMs to expedite this training process. By observing sparsity in activated neurons during forward iterations, we identify the potential for computational speed-ups by excluding inactive neurons. We address associated challenges by extending existing neuron importance evaluation metrics and introducing a ladder omission rate scheduler. Our experiments on Llama-2 demonstrate that Sparsity-Accelerated Training (SAT) achieves comparable or superior performance to standard training while significantly accelerating the process. Specifically, SAT achieves a $45%$ throughput improvement in continual pre-training and saves $38%$ training time in supervised fine-tuning in practice. It offers a simple, hardware-agnostic, and easily deployable framework for additional LLM training. Our code is available at https://github.com/OpenDFM/SAT.

6/7/2024

cs.CL