ALPS: Improved Optimization for Highly Sparse One-Shot Pruning for Large Language Models

2406.07831

Published 6/13/2024 by Xiang Meng, Kayhan Behdin, Haoyue Wang, Rahul Mazumder

🛠️

Abstract

The impressive performance of Large Language Models (LLMs) across various natural language processing tasks comes at the cost of vast computational resources and storage requirements. One-shot pruning techniques offer a way to alleviate these burdens by removing redundant weights without the need for retraining. Yet, the massive scale of LLMs often forces current pruning approaches to rely on heuristics instead of optimization-based techniques, potentially resulting in suboptimal compression. In this paper, we introduce ALPS, an optimization-based framework that tackles the pruning problem using the operator splitting technique and a preconditioned conjugate gradient-based post-processing step. Our approach incorporates novel techniques to accelerate and theoretically guarantee convergence while leveraging vectorization and GPU parallelism for efficiency. ALPS substantially outperforms state-of-the-art methods in terms of the pruning objective and perplexity reduction, particularly for highly sparse models. On the OPT-30B model with 70% sparsity, ALPS achieves a 13% reduction in test perplexity on the WikiText dataset and a 19% improvement in zero-shot benchmark performance compared to existing methods.

Create account to get full access

Overview

Large Language Models (LLMs) have demonstrated impressive performance across various natural language processing tasks, but this comes at the cost of high computational and storage requirements.
One-shot pruning techniques offer a way to reduce the size of LLMs by removing redundant weights without the need for retraining.
However, the massive scale of LLMs often forces current pruning approaches to rely on heuristics instead of optimization-based techniques, potentially resulting in suboptimal compression.
This paper introduces ALPS, an optimization-based framework that tackles the pruning problem using the operator splitting technique and a preconditioned conjugate gradient-based post-processing step.

Plain English Explanation

Large Language Models (LLMs) are powerful artificial intelligence systems that can perform a wide range of natural language processing tasks, such as answering questions, generating text, and translating between languages. These models have become increasingly impressive in their capabilities, but they require vast computational resources and storage space to operate.

To address this issue, researchers have developed one-shot pruning techniques, which can remove redundant weights from the model without the need for retraining. This helps to reduce the overall size and resource requirements of the LLM. However, the massive scale of these models often forces current pruning approaches to rely on heuristics, or rules of thumb, instead of more sophisticated optimization techniques. This can lead to suboptimal compression, where the model is not as efficiently pruned as it could be.

The paper introduces a new framework called ALPS, which takes a more advanced, optimization-based approach to pruning LLMs. ALPS uses a technique called operator splitting, along with a preconditioned conjugate gradient-based post-processing step, to identify and remove redundant weights in a more systematic and effective way. This allows ALPS to substantially outperform existing pruning methods, particularly for highly sparse models (models with a large number of weights removed).

For example, when applied to the OPT-30B model with 70% sparsity, ALPS achieved a 13% reduction in test perplexity on the WikiText dataset and a 19% improvement in zero-shot benchmark performance compared to other pruning techniques. This means the pruned model maintained better language understanding and task-solving abilities compared to models pruned using other methods.

Technical Explanation

The ALPS framework uses the operator splitting technique and a preconditioned conjugate gradient-based post-processing step to tackle the pruning problem for large language models.

The operator splitting approach allows ALPS to decompose the pruning problem into simpler subproblems that can be solved more efficiently. The preconditioned conjugate gradient step then refines the pruning decisions, leveraging the structure of the language model to guide the optimization process and ensure high-quality compression.

ALPS incorporates several novel techniques to accelerate and theoretically guarantee the convergence of the pruning optimization. These include:

Vectorization and GPU parallelism to improve computational efficiency
Careful initialization and preconditioning to speed up the optimization process
Theoretical guarantees on the convergence of the pruning algorithm

Through these advances, ALPS is able to substantially outperform state-of-the-art pruning methods, particularly for highly sparse models. For example, on the OPT-30B model with 70% sparsity, ALPS achieves a 13% reduction in test perplexity on the WikiText dataset and a 19% improvement in zero-shot benchmark performance compared to existing techniques.

Critical Analysis

The paper provides a thorough and technically rigorous approach to the problem of pruning large language models. The ALPS framework represents a significant advance over existing heuristic-based pruning methods, demonstrating the benefits of using more sophisticated optimization techniques.

However, the paper does not address some potential limitations or areas for further research. For instance, the ALPS framework is designed for one-shot pruning, where the entire model is pruned at once. It's unclear how well the approach would scale to iterative or structured pruning schemes, where the model is gradually pruned over multiple steps or with specific patterns.

Additionally, the paper focuses solely on perplexity and zero-shot benchmark performance as the evaluation metrics for the pruned models. While these are important measures, it would be valuable to also consider other factors, such as the impact on downstream task performance, the ability to fine-tune the pruned models, or the tradeoffs between compression and model performance.

Further research could also explore the applicability of the ALPS framework to other types of large neural networks beyond language models, as the core optimization techniques may be more broadly useful.

Conclusion

The ALPS framework introduced in this paper represents a significant advancement in the field of pruning large language models. By leveraging optimization-based techniques, rather than heuristics, ALPS is able to achieve substantially better compression and performance preservation than existing pruning methods, particularly for highly sparse models.

This work highlights the potential for more sophisticated optimization approaches to tackle the computational and storage challenges posed by the ever-growing scale of large language models. As the field of natural language processing continues to advance, techniques like ALPS will be crucial in enabling the deployment of these powerful AI systems in real-world applications with limited resources.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

SparseLLM: Towards Global Pruning for Pre-trained Language Models

Guangji Bai, Yijiang Li, Chen Ling, Kibaek Kim, Liang Zhao

The transformative impact of large language models (LLMs) like LLaMA and GPT on natural language processing is countered by their prohibitive computational demands. Pruning has emerged as a pivotal compression strategy, introducing sparsity to enhance both memory and computational efficiency. Yet, traditional global pruning is impractical for LLMs due to scalability issues, while local pruning, despite its efficiency, leads to suboptimal solutions. Addressing these challenges, we propose SparseLLM, a novel framework that redefines the global pruning process into manageable, coordinated subproblems, allowing for resource-efficient optimization with global optimality. SparseLLM's approach, which conceptualizes LLMs as a chain of modular functions and leverages auxiliary variables for problem decomposition, not only facilitates a pragmatic application on LLMs but also demonstrates significant performance improvements, particularly in high-sparsity regimes where it surpasses current state-of-the-art methods.

5/27/2024

cs.CL

Optimization-based Structural Pruning for Large Language Models without Back-Propagation

Yuan Gao, Zujing Liu, Weizhong Zhang, Bo Du, Gui-Song Xia

Compared to the moderate size of neural network models, structural weight pruning on the Large-Language Models (LLMs) imposes a novel challenge on the efficiency of the pruning algorithms, due to the heavy computation/memory demands of the LLMs. Recent efficient LLM pruning methods typically operate at the post-training phase without the expensive weight finetuning, however, their pruning criteria often rely on heuristically designed metrics, potentially leading to suboptimal performance. We instead propose a novel optimization-based structural pruning that learns the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model. To preserve the efficiency, our method 1) works at post-training phase} and 2) eliminates the back-propagation through the LLM per se during the optimization (i.e., only requires the forward pass of the LLM). We achieve this by learning an underlying Bernoulli distribution to sample binary pruning masks, where we decouple the Bernoulli parameters from the LLM loss, thus facilitating an efficient optimization via a policy gradient estimator without back-propagation. As a result, our method is able to 1) operate at structural granularities of channels, heads, and layers, 2) support global and heterogeneous pruning (i.e., our method automatically determines different redundancy for different layers), and 3) optionally use a metric-based method as initialization (of our Bernoulli distributions). Extensive experiments on LLaMA, LLaMA-2, and Vicuna using the C4 and WikiText2 datasets demonstrate that our method operates for 2.7 hours with around 35GB memory for the 13B models on a single A100 GPU, and our pruned models outperform the state-of-the-arts w.r.t. perplexity. Codes will be released.

6/18/2024

cs.LG cs.CL stat.ML

💬

One-Shot Sensitivity-Aware Mixed Sparsity Pruning for Large Language Models

Hang Shao, Bei Liu, Bo Xiao, Ke Zeng, Guanglu Wan, Yanmin Qian

Various Large Language Models~(LLMs) from the Generative Pretrained Transformer(GPT) family have achieved outstanding performances in a wide range of text generation tasks. However, the enormous model sizes have hindered their practical use in real-world applications due to high inference latency. Therefore, improving the efficiencies of LLMs through quantization, pruning, and other means has been a key issue in LLM studies. In this work, we propose a method based on Hessian sensitivity-aware mixed sparsity pruning to prune LLMs to at least 50% sparsity without the need of any retraining. It allocates sparsity adaptively based on sensitivity, allowing us to reduce pruning-induced error while maintaining the overall sparsity level. The advantages of the proposed method exhibit even more when the sparsity is extremely high. Furthermore, our method is compatible with quantization, enabling further compression of LLMs. We have released the available code.

4/24/2024

cs.CL cs.AI

Efficient Pruning of Large Language Model with Adaptive Estimation Fusion

Jun Liu, Chao Wu, Changdi Yang, Hao Tang, Zhenglun Kong, Geng Yuan, Wei Niu, Dong Huang, Yanzhi Wang

Large language models (LLMs) have become crucial for many generative downstream tasks, leading to an inevitable trend and significant challenge to deploy them efficiently on resource-constrained devices. Structured pruning is a widely used method to address this challenge. However, when dealing with the complex structure of the multiple decoder layers, general methods often employ common estimation approaches for pruning. These approaches lead to a decline in accuracy for specific downstream tasks. In this paper, we introduce a simple yet efficient method that adaptively models the importance of each substructure. Meanwhile, it can adaptively fuse coarse-grained and finegrained estimations based on the results from complex and multilayer structures. All aspects of our design seamlessly integrate into the endto-end pruning framework. Our experimental results, compared with state-of-the-art methods on mainstream datasets, demonstrate average accuracy improvements of 1.1%, 1.02%, 2.0%, and 1.2% for LLaMa-7B,Vicuna-7B, Baichuan-7B, and Bloom-7b1, respectively.

5/16/2024

cs.CL cs.AI cs.LG