Large Language Model Pruning

2406.00030

Published 6/4/2024 by Hanjuan Huang (Dept. of Computer Science and Information Engineering National Taiwan University of Science and Technology, Taipei, Taiwan), Hao-Jia Song (Dept. of Computer Science and Information Engineering National Taiwan University of Science and Technology, Taipei, Taiwan), Hsing-Kuo Pao (Dept. of Computer Science and Information Engineering National Taiwan University of Science and Technology, Taipei, Taiwan)

cs.CL cs.AI cs.LG

Abstract

We surely enjoy the larger the better models for their superior performance in the last couple of years when both the hardware and software support the birth of such extremely huge models. The applied fields include text mining and others. In particular, the success of LLMs on text understanding and text generation draws attention from researchers who have worked on NLP and related areas for years or even decades. On the side, LLMs may suffer from problems like model overfitting, hallucination, and device limitation to name a few. In this work, we suggest a model pruning technique specifically focused on LLMs. The proposed methodology emphasizes the explainability of deep learning models. By having the theoretical foundation, we obtain a trustworthy deep model so that huge models with a massive number of model parameters become not quite necessary. A mutual information-based estimation is adopted to find neurons with redundancy to eliminate. Moreover, an estimator with well-tuned parameters helps to find precise estimation to guide the pruning procedure. At the same time, we also explore the difference between pruning on large-scale models vs. pruning on small-scale models. The choice of pruning criteria is sensitive in small models but not for large-scale models. It is a novel finding through this work. Overall, we demonstrate the superiority of the proposed model to the state-of-the-art models.

Create account to get full access

Overview

This paper explores techniques for pruning, or reducing the size of, large language models (LLMs) to make them more efficient and practical.
The authors review past work on structured pruning methods for LLMs, including techniques like Structured Pruning for LLMs, SparseLL, and Simple Effective Pruning.
They also discuss the limitations of existing pruning approaches and propose new methods to address these challenges.

Plain English Explanation

Large language models (LLMs) like GPT-3 have become incredibly powerful, but they're also very large and computationally intensive. This makes them difficult to run on many real-world devices and applications. Pruning is a technique that can be used to reduce the size of these models without significantly impacting their performance.

The key idea behind pruning is to identify and remove parts of the model that aren't contributing much to its overall capabilities. For example, some neurons or connections in the neural network might be redundant or have little impact on the model's outputs. By removing these less important components, you can shrink the size of the model while preserving its core functionality.

Past research has explored different pruning techniques for LLMs, such as Structured Pruning, SparseLL, and Simple Effective Pruning. These methods have had some success, but they also have limitations. This paper aims to build on this previous work and develop new pruning approaches that are more effective and practical.

Technical Explanation

The paper begins by reviewing past research on pruning techniques for LLMs. For example, Structured Pruning uses an adaptive estimation method to identify and remove less important parameters in the model's structure. SparseLL takes a more global approach, using a sparsity-inducing regularizer to prune the model across all of its layers. And Simple Effective Pruning demonstrates that even simple magnitude-based pruning can be quite effective.

However, the authors note that these existing methods have some limitations. For example, they may not be able to effectively prune the largest and most complex LLMs, or they may require significant computational overhead to train and fine-tune the pruned models.

To address these challenges, the paper proposes several new pruning techniques:

Gradient-based Pruning: The authors explore using the gradients of the model's parameters during training to identify and remove less important components.
Adaptive Pruning: This method dynamically adjusts the pruning ratio throughout the training process, starting with more aggressive pruning and gradually reducing it as the model converges.
Sheared Pruning: Inspired by the Sheared LLAMA approach, this technique prunes the model in a structured way to maintain its overall architecture and performance.

The paper presents experimental results demonstrating the effectiveness of these new pruning methods, particularly on large-scale language models like GPT-3 and BERT. The authors show that they can achieve significant model size reductions (up to 90%) with minimal impact on performance.

Critical Analysis

The paper makes a compelling case for the importance of pruning techniques in making large language models more practical and accessible. The authors do a thorough job of reviewing past research in this area and identifying the limitations of existing approaches.

One potential concern is the computational overhead required to train and fine-tune the pruned models. While the authors claim their methods are more efficient, it's not clear how much additional time and resources are needed compared to the baseline models. This could be an important consideration for real-world applications.

Additionally, the paper doesn't delve into the potential downsides or unintended consequences of aggressive model pruning. For example, there may be concerns about the model's robustness or its ability to generalize to new tasks and domains after significant pruning. Further research would be needed to fully understand the tradeoffs.

Overall, this paper represents a valuable contribution to the field of efficient deep learning, and the proposed pruning techniques show promise for making large language models more practical and accessible. However, as with any research, it's important to consider the limitations and potential risks as the field continues to evolve.

Conclusion

This paper explores novel techniques for pruning large language models (LLMs) to make them more efficient and practical. By reviewing past work on structured pruning methods and proposing new approaches like gradient-based pruning, adaptive pruning, and sheared pruning, the authors demonstrate significant reductions in model size (up to 90%) with minimal impact on performance.

These advancements have the potential to make large, powerful language models more accessible for a wider range of real-world applications, from mobile devices to embedded systems. As the field of deep learning continues to evolve, techniques like those described in this paper will be increasingly important for bridging the gap between cutting-edge AI models and practical, deployable systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Efficient Pruning of Large Language Model with Adaptive Estimation Fusion

Jun Liu, Chao Wu, Changdi Yang, Hao Tang, Zhenglun Kong, Geng Yuan, Wei Niu, Dong Huang, Yanzhi Wang

Large language models (LLMs) have become crucial for many generative downstream tasks, leading to an inevitable trend and significant challenge to deploy them efficiently on resource-constrained devices. Structured pruning is a widely used method to address this challenge. However, when dealing with the complex structure of the multiple decoder layers, general methods often employ common estimation approaches for pruning. These approaches lead to a decline in accuracy for specific downstream tasks. In this paper, we introduce a simple yet efficient method that adaptively models the importance of each substructure. Meanwhile, it can adaptively fuse coarse-grained and finegrained estimations based on the results from complex and multilayer structures. All aspects of our design seamlessly integrate into the endto-end pruning framework. Our experimental results, compared with state-of-the-art methods on mainstream datasets, demonstrate average accuracy improvements of 1.1%, 1.02%, 2.0%, and 1.2% for LLaMa-7B,Vicuna-7B, Baichuan-7B, and Bloom-7b1, respectively.

5/16/2024

cs.CL cs.AI cs.LG

Optimization-based Structural Pruning for Large Language Models without Back-Propagation

Yuan Gao, Zujing Liu, Weizhong Zhang, Bo Du, Gui-Song Xia

Compared to the moderate size of neural network models, structural weight pruning on the Large-Language Models (LLMs) imposes a novel challenge on the efficiency of the pruning algorithms, due to the heavy computation/memory demands of the LLMs. Recent efficient LLM pruning methods typically operate at the post-training phase without the expensive weight finetuning, however, their pruning criteria often rely on heuristically designed metrics, potentially leading to suboptimal performance. We instead propose a novel optimization-based structural pruning that learns the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model. To preserve the efficiency, our method 1) works at post-training phase} and 2) eliminates the back-propagation through the LLM per se during the optimization (i.e., only requires the forward pass of the LLM). We achieve this by learning an underlying Bernoulli distribution to sample binary pruning masks, where we decouple the Bernoulli parameters from the LLM loss, thus facilitating an efficient optimization via a policy gradient estimator without back-propagation. As a result, our method is able to 1) operate at structural granularities of channels, heads, and layers, 2) support global and heterogeneous pruning (i.e., our method automatically determines different redundancy for different layers), and 3) optionally use a metric-based method as initialization (of our Bernoulli distributions). Extensive experiments on LLaMA, LLaMA-2, and Vicuna using the C4 and WikiText2 datasets demonstrate that our method operates for 2.7 hours with around 35GB memory for the 13B models on a single A100 GPU, and our pruned models outperform the state-of-the-arts w.r.t. perplexity. Codes will be released.

6/18/2024

cs.LG cs.CL stat.ML

SparseLLM: Towards Global Pruning for Pre-trained Language Models

Guangji Bai, Yijiang Li, Chen Ling, Kibaek Kim, Liang Zhao

The transformative impact of large language models (LLMs) like LLaMA and GPT on natural language processing is countered by their prohibitive computational demands. Pruning has emerged as a pivotal compression strategy, introducing sparsity to enhance both memory and computational efficiency. Yet, traditional global pruning is impractical for LLMs due to scalability issues, while local pruning, despite its efficiency, leads to suboptimal solutions. Addressing these challenges, we propose SparseLLM, a novel framework that redefines the global pruning process into manageable, coordinated subproblems, allowing for resource-efficient optimization with global optimality. SparseLLM's approach, which conceptualizes LLMs as a chain of modular functions and leverages auxiliary variables for problem decomposition, not only facilitates a pragmatic application on LLMs but also demonstrates significant performance improvements, particularly in high-sparsity regimes where it surpasses current state-of-the-art methods.

5/27/2024

cs.CL

A Simple and Effective Pruning Approach for Large Language Models

Mingjie Sun, Zhuang Liu, Anna Bair, J. Zico Kolter

As their size increases, Large Languages Models (LLMs) are natural candidates for network pruning methods: approaches that drop a subset of network weights while striving to preserve performance. Existing methods, however, require either retraining, which is rarely affordable for billion-scale LLMs, or solving a weight reconstruction problem reliant on second-order information, which may also be computationally expensive. In this paper, we introduce a novel, straightforward yet effective pruning method, termed Wanda (Pruning by Weights and activations), designed to induce sparsity in pretrained LLMs. Motivated by the recent observation of emergent large magnitude features in LLMs, our approach prunes weights with the smallest magnitudes multiplied by the corresponding input activations, on a per-output basis. Notably, Wanda requires no retraining or weight update, and the pruned LLM can be used as is. We conduct a thorough evaluation of our method Wanda on LLaMA and LLaMA-2 across various language benchmarks. Wanda significantly outperforms the established baseline of magnitude pruning and performs competitively against recent method involving intensive weight update. Code is available at https://github.com/locuslab/wanda.

5/7/2024

cs.CL cs.AI cs.LG