Optimization-based Structural Pruning for Large Language Models without Back-Propagation

2406.10576

Published 6/18/2024 by Yuan Gao, Zujing Liu, Weizhong Zhang, Bo Du, Gui-Song Xia

Optimization-based Structural Pruning for Large Language Models without Back-Propagation

Abstract

Compared to the moderate size of neural network models, structural weight pruning on the Large-Language Models (LLMs) imposes a novel challenge on the efficiency of the pruning algorithms, due to the heavy computation/memory demands of the LLMs. Recent efficient LLM pruning methods typically operate at the post-training phase without the expensive weight finetuning, however, their pruning criteria often rely on heuristically designed metrics, potentially leading to suboptimal performance. We instead propose a novel optimization-based structural pruning that learns the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model. To preserve the efficiency, our method 1) works at post-training phase} and 2) eliminates the back-propagation through the LLM per se during the optimization (i.e., only requires the forward pass of the LLM). We achieve this by learning an underlying Bernoulli distribution to sample binary pruning masks, where we decouple the Bernoulli parameters from the LLM loss, thus facilitating an efficient optimization via a policy gradient estimator without back-propagation. As a result, our method is able to 1) operate at structural granularities of channels, heads, and layers, 2) support global and heterogeneous pruning (i.e., our method automatically determines different redundancy for different layers), and 3) optionally use a metric-based method as initialization (of our Bernoulli distributions). Extensive experiments on LLaMA, LLaMA-2, and Vicuna using the C4 and WikiText2 datasets demonstrate that our method operates for 2.7 hours with around 35GB memory for the 13B models on a single A100 GPU, and our pruned models outperform the state-of-the-arts w.r.t. perplexity. Codes will be released.

Create account to get full access

Overview

The paper presents an optimization-based structural pruning method for large language models without back-propagation.
The proposed approach aims to reduce the model size and inference time of large language models while preserving their performance.
The method involves a two-stage optimization process that first prunes the model's structure and then fine-tunes the remaining weights.

Plain English Explanation

Large language models, such as GPT-3 and BERT, have become increasingly powerful in various natural language processing tasks. However, these models are often very large and computationally intensive, making them challenging to deploy on resource-constrained devices or in low-latency applications.

The researchers in this paper introduce a new pruning technique that can reduce the size and inference time of large language models without significantly impacting their performance. Unlike traditional pruning methods that rely on back-propagation, their approach uses an optimization-based method that directly optimizes the structure of the model.

The key idea is to first identify which parts of the model can be safely removed without degrading performance. This is done through a two-stage optimization process. In the first stage, the method prunes the model's structure by removing unnecessary connections and neurons. In the second stage, the remaining weights are fine-tuned to restore the model's performance.

By using this optimization-based approach, the researchers were able to achieve significant model size and inference time reductions, while maintaining the model's accuracy on various language tasks. This could make it easier to deploy large language models in real-world applications, where computational resources and latency are important considerations.

Technical Explanation

The paper presents an optimization-based structural pruning method for large language models that does not require back-propagation. The proposed approach involves a two-stage optimization process:

Structure Pruning: In the first stage, the method optimizes the structure of the model by removing unnecessary connections and neurons. This is done by formulating the pruning problem as a bilevel optimization problem, where the upper-level optimization aims to minimize the model size while the lower-level optimization ensures that the model's performance is maintained.
Weight Fine-tuning: After pruning the model's structure, the remaining weights are fine-tuned to restore the model's performance. This is done by solving a convex optimization problem that minimizes the difference between the original and pruned model's outputs.

The researchers evaluated their approach on various large language models, including GPT-2 and BERT, and demonstrated significant reductions in model size and inference time without significant performance degradation. For example, they were able to prune a GPT-2 model by 50% while maintaining 98% of its original performance.

The key advantage of this optimization-based approach is that it does not require back-propagation, which can be computationally expensive and potentially unstable for large language models. Instead, the method directly optimizes the model's structure, making it more efficient and scalable.

Critical Analysis

The proposed optimization-based structural pruning method is a promising approach for reducing the size and inference time of large language models. The researchers have demonstrated its effectiveness on several popular models, and the lack of reliance on back-propagation is a notable advantage.

However, the paper does not address several potential limitations and areas for further research:

Generalization to Diverse Architectures: The experiments in the paper focus on transformer-based models like GPT-2 and BERT. It's unclear how well the method would generalize to other language model architectures, such as recurrent neural networks or convolutional models.
Scalability to Extremely Large Models: The experiments were conducted on relatively large, but not the largest, language models available. It's uncertain whether the optimization-based approach would scale well to the latest, extremely large models like GPT-3 or PaLM.
Transferability to Downstream Tasks: While the paper shows that the pruned models maintain their performance on the original language tasks, it does not investigate how the pruned models might perform on downstream tasks, such as question answering or text generation. This is an important area for further research.
Interpretability of Pruning Decisions: The paper does not provide much insight into which parts of the model are being pruned and why. A more interpretable pruning method could be beneficial for understanding the model's inner workings and potentially improving its performance.

Despite these limitations, the optimization-based structural pruning method presented in this paper is a valuable contribution to the field of efficient large language model deployment. Further research addressing the identified limitations could lead to even more impactful advancements in this area.

Conclusion

The paper introduces an optimization-based structural pruning method for reducing the size and inference time of large language models without back-propagation. The proposed approach involves a two-stage optimization process that first prunes the model's structure and then fine-tunes the remaining weights.

The researchers demonstrated the effectiveness of their method on popular language models like GPT-2 and BERT, achieving significant reductions in model size and inference time while maintaining the models' performance. This could make it easier to deploy large language models in real-world applications with limited computational resources or low-latency requirements.

While the paper has some limitations, such as the need to investigate the method's scalability and transferability to diverse architectures and downstream tasks, the optimization-based structural pruning technique presented in this work is a valuable contribution to the field of efficient large language model deployment.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Efficient Pruning of Large Language Model with Adaptive Estimation Fusion

Jun Liu, Chao Wu, Changdi Yang, Hao Tang, Zhenglun Kong, Geng Yuan, Wei Niu, Dong Huang, Yanzhi Wang

Large language models (LLMs) have become crucial for many generative downstream tasks, leading to an inevitable trend and significant challenge to deploy them efficiently on resource-constrained devices. Structured pruning is a widely used method to address this challenge. However, when dealing with the complex structure of the multiple decoder layers, general methods often employ common estimation approaches for pruning. These approaches lead to a decline in accuracy for specific downstream tasks. In this paper, we introduce a simple yet efficient method that adaptively models the importance of each substructure. Meanwhile, it can adaptively fuse coarse-grained and finegrained estimations based on the results from complex and multilayer structures. All aspects of our design seamlessly integrate into the endto-end pruning framework. Our experimental results, compared with state-of-the-art methods on mainstream datasets, demonstrate average accuracy improvements of 1.1%, 1.02%, 2.0%, and 1.2% for LLaMa-7B,Vicuna-7B, Baichuan-7B, and Bloom-7b1, respectively.

5/16/2024

cs.CL cs.AI cs.LG

💬

Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning

Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, Danqi Chen

The popularity of LLaMA (Touvron et al., 2023a;b) and other recently emerged moderate-sized large language models (LLMs) highlights the potential of building smaller yet powerful LLMs. Regardless, the cost of training such models from scratch on trillions of tokens remains high. In this work, we study structured pruning as an effective means to develop smaller LLMs from pre-trained, larger models. Our approach employs two key techniques: (1) targeted structured pruning, which prunes a larger model to a specified target shape by removing layers, heads, and intermediate and hidden dimensions in an end-to-end manner, and (2) dynamic batch loading, which dynamically updates the composition of sampled data in each training batch based on varying losses across different domains. We demonstrate the efficacy of our approach by presenting the Sheared-LLaMA series, pruning the LLaMA2-7B model down to 1.3B and 2.7B parameters. Sheared-LLaMA models outperform state-of-the-art open-source models of equivalent sizes, such as Pythia, INCITE, OpenLLaMA and the concurrent TinyLlama models, on a wide range of downstream and instruction tuning evaluations, while requiring only 3% of compute compared to training such models from scratch. This work provides compelling evidence that leveraging existing LLMs with structured pruning is a far more cost-effective approach for building competitive small-scale LLMs

4/12/2024

cs.CL cs.AI cs.LG

A Simple and Effective Pruning Approach for Large Language Models

Mingjie Sun, Zhuang Liu, Anna Bair, J. Zico Kolter

As their size increases, Large Languages Models (LLMs) are natural candidates for network pruning methods: approaches that drop a subset of network weights while striving to preserve performance. Existing methods, however, require either retraining, which is rarely affordable for billion-scale LLMs, or solving a weight reconstruction problem reliant on second-order information, which may also be computationally expensive. In this paper, we introduce a novel, straightforward yet effective pruning method, termed Wanda (Pruning by Weights and activations), designed to induce sparsity in pretrained LLMs. Motivated by the recent observation of emergent large magnitude features in LLMs, our approach prunes weights with the smallest magnitudes multiplied by the corresponding input activations, on a per-output basis. Notably, Wanda requires no retraining or weight update, and the pruned LLM can be used as is. We conduct a thorough evaluation of our method Wanda on LLaMA and LLaMA-2 across various language benchmarks. Wanda significantly outperforms the established baseline of magnitude pruning and performs competitively against recent method involving intensive weight update. Code is available at https://github.com/locuslab/wanda.

5/7/2024

cs.CL cs.AI cs.LG

💬

Structural Pruning of Pre-trained Language Models via Neural Architecture Search

Aaron Klein, Jacek Golebiowski, Xingchen Ma, Valerio Perrone, Cedric Archambeau

Pre-trained language models (PLM), for example BERT or RoBERTa, mark the state-of-the-art for natural language understanding task when fine-tuned on labeled data. However, their large size poses challenges in deploying them for inference in real-world applications, due to significant GPU memory requirements and high inference latency. This paper explores neural architecture search (NAS) for structural pruning to find sub-parts of the fine-tuned network that optimally trade-off efficiency, for example in terms of model size or latency, and generalization performance. We also show how we can utilize more recently developed two-stage weight-sharing NAS approaches in this setting to accelerate the search process. Unlike traditional pruning methods with fixed thresholds, we propose to adopt a multi-objective approach that identifies the Pareto optimal set of sub-networks, allowing for a more flexible and automated compression process.

5/6/2024

cs.LG cs.CL