Pruning Large Language Models with Semi-Structural Adaptive Sparse Training

Read original: arXiv:2407.20584 - Published 8/27/2024 by Weiyu Huang, Yuezhou Hu, Guohao Jian, Jun Zhu, Jianfei Chen

Pruning Large Language Models with Semi-Structural Adaptive Sparse Training

Overview

This paper introduces a new method for pruning large language models to reduce their size and improve efficiency.
The proposed approach, called Semi-Structural Adaptive Sparse Training (SAST), leverages both structured and unstructured sparsity to selectively prune model parameters.
SAST is designed to be more effective than existing pruning techniques while maintaining the model's performance on various tasks.

Plain English Explanation

The paper discusses a new way to prune large language models, which are powerful AI systems trained on massive amounts of text data. Pruning refers to the process of selectively removing parts of the model to make it smaller and more efficient, without significantly impacting its performance.

The key idea behind the proposed Semi-Structural Adaptive Sparse Training (SAST) method is to use a combination of structured and unstructured sparsity. Structured sparsity means pruning entire blocks or groups of model parameters, while unstructured sparsity refers to pruning individual parameters. By using both approaches, the researchers aim to achieve a more effective pruning strategy that can significantly reduce the model size while preserving its capabilities across a range of tasks.

This is important because large language models, like the ones used in popular AI assistants, can be very computationally intensive and resource-heavy. Pruning them can make them more efficient and easier to deploy on a wider range of devices, from powerful servers to more constrained edge devices.

Technical Explanation

The paper presents the Semi-Structural Adaptive Sparse Training (SAST) method, which combines structured and unstructured sparsity to prune large language models. The key elements of SAST include:

Structured Sparsity: SAST identifies and prunes entire blocks or groups of model parameters, which can lead to more efficient hardware implementation.
Unstructured Sparsity: SAST also prunes individual model parameters, allowing for more fine-grained control over the pruning process.
Adaptive Pruning: SAST uses an adaptive pruning strategy that adjusts the pruning rate based on the model's performance on a held-out validation set, ensuring that the pruning does not significantly degrade the model's capabilities.

The researchers evaluate SAST on several large language models, including GPT-2 and BERT, and demonstrate that it can achieve higher pruning rates compared to existing methods while maintaining the models' performance on a range of tasks.

Critical Analysis

The paper presents a well-designed and thorough evaluation of the SAST pruning method, but it also acknowledges several limitations and areas for further research:

The researchers note that the optimal trade-off between model size reduction and performance preservation may vary depending on the specific use case and requirements, and that further work is needed to determine the best approach for different scenarios.
The paper does not provide a detailed analysis of the computational and memory footprint of the SAST method itself, which could be an important consideration when deploying the pruned models in resource-constrained environments.
The evaluation is limited to a relatively small set of language models and tasks, and it would be valuable to see the performance of SAST on a broader range of large language models and applications.

Overall, the SAST method presented in this paper represents a promising approach to pruning large language models, but there are still opportunities for further refinement and exploration to address these potential limitations.

Conclusion

This paper introduces a novel pruning method called Semi-Structural Adaptive Sparse Training (SAST) that combines structured and unstructured sparsity to effectively reduce the size of large language models while preserving their performance. The key innovation of SAST is its ability to adaptively adjust the pruning rate based on the model's validation performance, ensuring that the pruning process does not significantly degrade the model's capabilities.

The researchers demonstrate the effectiveness of SAST on several popular language models, showing that it can achieve higher pruning rates compared to existing methods. This is an important advance, as it could enable the deployment of large language models on a wider range of devices, from powerful servers to more constrained edge devices, making these powerful AI systems more accessible and efficient.

While the paper presents a well-designed and thorough evaluation, it also acknowledges several limitations and areas for further research, such as the need to explore the optimal trade-off between model size reduction and performance preservation for different use cases, and the potential computational and memory overhead of the SAST method itself.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Pruning Large Language Models with Semi-Structural Adaptive Sparse Training

Weiyu Huang, Yuezhou Hu, Guohao Jian, Jun Zhu, Jianfei Chen

The tremendous success of Large Language Models (LLMs) across various complex tasks relies heavily on their substantial scale, which raises challenges during model deployment due to their large memory consumption. Recently, numerous studies have attempted to compress LLMs using one-shot pruning methods. However, these methods often experience considerable performance degradation on complex language understanding tasks, calling into question the feasibility of pruning in LLMs. To address this issue, we propose a pruning pipeline for semi-structured sparse models via retraining, termed Adaptive Sparse Trainer (AST). Unlike previous one-shot pruning methods, AST incrementally transforms dense models into sparse ones by applying decay to masked weights while allowing the model to adaptively select masks throughout the training process. Furthermore, we observe that using distillation with a dense model as the teacher can prevent the sparse model from falling into local optima and accelerate convergence. In addition, we incorporate extra well-initialized parameters to further enhance model performance with minimal increase in memory footprint. AST can significantly enhance model performance, approaching the level of dense models. When applied to the LLaMA2-7B model, AST reduces the zero-shot accuracy gap between dense and semi-structured sparse models to 1.12% across multiple zero-shot tasks, utilizing less than 0.4% of the pretraining tokens. Our work demonstrates the feasibility of deploying semi-structured sparse large language models and introduces a novel method for achieving highly compressed models when combined with existing quantization techniques.

8/27/2024

💬

One-Shot Sensitivity-Aware Mixed Sparsity Pruning for Large Language Models

Hang Shao, Bei Liu, Bo Xiao, Ke Zeng, Guanglu Wan, Yanmin Qian

Various Large Language Models~(LLMs) from the Generative Pretrained Transformer(GPT) family have achieved outstanding performances in a wide range of text generation tasks. However, the enormous model sizes have hindered their practical use in real-world applications due to high inference latency. Therefore, improving the efficiencies of LLMs through quantization, pruning, and other means has been a key issue in LLM studies. In this work, we propose a method based on Hessian sensitivity-aware mixed sparsity pruning to prune LLMs to at least 50% sparsity without the need of any retraining. It allocates sparsity adaptively based on sensitivity, allowing us to reduce pruning-induced error while maintaining the overall sparsity level. The advantages of the proposed method exhibit even more when the sparsity is extremely high. Furthermore, our method is compatible with quantization, enabling further compression of LLMs. We have released the available code.

4/24/2024

❗

Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment

Abhinav Agarwalla, Abhay Gupta, Alexandre Marques, Shubhra Pandit, Michael Goin, Eldar Kurtic, Kevin Leong, Tuan Nguyen, Mahmoud Salem, Dan Alistarh, Sean Lie, Mark Kurtz

Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks. We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs that achieve full accuracy recovery for fine-tuning tasks at up to 70% sparsity. We achieve this for the LLaMA-2 7B model by combining the SparseGPT one-shot pruning method and sparse pretraining of those models on a subset of the SlimPajama dataset mixed with a Python subset of The Stack dataset. We exhibit training acceleration due to sparsity on Cerebras CS-3 chips that closely matches theoretical scaling. In addition, we establish inference acceleration of up to 3x on CPUs by utilizing Neural Magic's DeepSparse engine and 1.7x on GPUs through Neural Magic's nm-vllm engine. The above gains are realized via sparsity alone, thus enabling further gains through additional use of quantization. Specifically, we show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x. We demonstrate these results across diverse, challenging tasks, including chat, instruction following, code generation, arithmetic reasoning, and summarization to prove their generality. This work paves the way for rapidly creating smaller and faster LLMs without sacrificing accuracy.

5/7/2024

Enhancing One-shot Pruned Pre-trained Language Models through Sparse-Dense-Sparse Mechanism

Guanchen Li, Xiandong Zhao, Lian Liu, Zeping Li, Dong Li, Lu Tian, Jie He, Ashish Sirasao, Emad Barsoum

Pre-trained language models (PLMs) are engineered to be robust in contextual understanding and exhibit outstanding performance in various natural language processing tasks. However, their considerable size incurs significant computational and storage costs. Modern pruning strategies employ one-shot techniques to compress PLMs without the need for retraining on task-specific or otherwise general data; however, these approaches often lead to an indispensable reduction in performance. In this paper, we propose SDS, a Sparse-Dense-Sparse pruning framework to enhance the performance of the pruned PLMs from a weight distribution optimization perspective. We outline the pruning process in three steps. Initially, we prune less critical connections in the model using conventional one-shot pruning methods. Next, we reconstruct a dense model featuring a pruning-friendly weight distribution by reactivating pruned connections with sparse regularization. Finally, we perform a second pruning round, yielding a superior pruned model compared to the initial pruning. Experimental results demonstrate that SDS outperforms the state-of-the-art pruning techniques SparseGPT and Wanda under an identical sparsity configuration. For instance, SDS reduces perplexity by 9.13 on Raw-Wikitext2 and improves accuracy by an average of 2.05% across multiple zero-shot benchmarks for OPT-125M with 2:4 sparsity.

8/21/2024