MoreauPruner: Robust Pruning of Large Language Models against Weight Perturbations

2406.07017

Published 6/12/2024 by Zixiao Wang, Jingwei Zhang, Wenqian Zhao, Farzan Farnia, Bei Yu

MoreauPruner: Robust Pruning of Large Language Models against Weight Perturbations

Abstract

Few-shot gradient methods have been extensively utilized in existing model pruning methods, where the model weights are regarded as static values and the effects of potential weight perturbations are not considered. However, the widely used large language models (LLMs) have several billion model parameters, which could increase the fragility of few-shot gradient pruning. In this work, we experimentally show that one-shot gradient pruning algorithms could lead to unstable results under perturbations to model weights. And the minor error of switching between data formats bfloat16 and float16 could result in drastically different outcomes. To address such instabilities, we leverage optimization analysis and propose an LLM structural pruning method, called MoreauPruner, with provable robustness against weight perturbations. In MoreauPruner, the model weight importance is estimated based on the neural network's Moreau envelope, which can be flexibly combined with $ell_1$-norm regularization techniques to induce the sparsity required in the pruning task. We extensively evaluate the MoreauPruner algorithm on several well-known LLMs, including LLaMA-7B, LLaMA-13B, LLaMA3-8B, and Vicuna-7B. Our numerical results suggest the robustness of MoreauPruner against weight perturbations, and indicate the MoreauPruner's successful accuracy-based scores in comparison to several existing pruning methods. We have released the code in url{https://github.com/ShiningSord/MoreauPruner}.

Create account to get full access

Overview

The paper introduces a new pruning method called "MoreauPruner" that aims to make large language models more robust against weight perturbations.
Pruning is the process of removing less important weights from a neural network to reduce its size and improve efficiency without significantly impacting performance.
MoreauPruner is designed to be more effective than existing pruning methods at maintaining model performance even when the remaining weights are perturbed.

Plain English Explanation

MoreauPruner is a new way to "prune" or remove less important parts of large language models like GPT-3 or BERT. Pruning can make these models smaller and more efficient, but it also makes them more vulnerable to small changes or "perturbations" in the remaining weights. MoreauPruner tries to address this by finding the weights that are most important and removing the rest in a way that makes the model more robust to perturbations.

The key idea is to use a concept called "Moreau envelopes" to identify the most important weights. This helps MoreauPruner remove less important weights while ensuring the remaining weights can withstand small changes without the model's performance suffering. This is important because in the real world, the weights of a language model may get slightly altered, for example, when running on different hardware. MoreauPruner aims to create a pruned model that can maintain its capabilities even when the weights are perturbed.

Technical Explanation

The paper proposes a new pruning method called MoreauPruner that leverages Moreau envelopes to identify the most important weights in a large language model. Moreau envelopes provide a way to measure the sensitivity of a weight to perturbations, allowing MoreauPruner to prioritize keeping weights that are more robust to changes.

The MoreauPruner algorithm first computes the Moreau envelope for each weight in the model. It then ranks the weights by their Moreau envelope values and prunes the lowest-ranked weights, effectively removing the least important and most fragile parameters. This pruning process is repeated iteratively until the desired model size is reached.

The authors evaluate MoreauPruner on several large language models, including BERT and GPT-3, and show that it outperforms existing pruning methods in maintaining model performance even when the remaining weights are perturbed. This improved robustness to weight perturbations is an important property, as it can help ensure the reliability of pruned models in real-world deployment scenarios.

Critical Analysis

The paper presents a well-designed and thorough evaluation of the MoreauPruner method, considering various perturbation scenarios and comparing its performance to state-of-the-art pruning approaches like simple effective pruning and adaptive estimation.

One potential limitation of the MoreauPruner method is its computational overhead, as the Moreau envelope calculations can be resource-intensive, especially for very large language models. The authors mention this and suggest that future work could explore ways to reduce the computational cost, such as approximating the Moreau envelope or using a more efficient implementation.

Another area for further research could be investigating the impact of MoreauPruner on the linguistic and reasoning capabilities of the pruned models, as pruning can potentially affect these properties in subtle ways. Sheared LLAMA is an example of research exploring the effects of pruning on model capabilities.

Overall, the MoreauPruner method presents a promising approach to making large language models more robust to weight perturbations, which is an important consideration for real-world deployments. The authors have made a valuable contribution to the ongoing research on effective pruning of large language models.

Conclusion

The MoreauPruner method introduced in this paper offers a new way to prune large language models that is more robust to weight perturbations compared to existing pruning techniques. By leveraging Moreau envelopes to identify and preserve the most important weights, MoreauPruner can maintain model performance even when the remaining weights are slightly altered.

This improved robustness is a valuable property, as it can help ensure the reliability and consistency of pruned language models when deployed in real-world applications. While the computational overhead of the Moreau envelope calculations is a potential limitation, the authors have demonstrated the effectiveness of MoreauPruner on several large language models, making it a promising contribution to the ongoing research on efficient and reliable model pruning.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Optimization-based Structural Pruning for Large Language Models without Back-Propagation

Yuan Gao, Zujing Liu, Weizhong Zhang, Bo Du, Gui-Song Xia

Compared to the moderate size of neural network models, structural weight pruning on the Large-Language Models (LLMs) imposes a novel challenge on the efficiency of the pruning algorithms, due to the heavy computation/memory demands of the LLMs. Recent efficient LLM pruning methods typically operate at the post-training phase without the expensive weight finetuning, however, their pruning criteria often rely on heuristically designed metrics, potentially leading to suboptimal performance. We instead propose a novel optimization-based structural pruning that learns the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model. To preserve the efficiency, our method 1) works at post-training phase} and 2) eliminates the back-propagation through the LLM per se during the optimization (i.e., only requires the forward pass of the LLM). We achieve this by learning an underlying Bernoulli distribution to sample binary pruning masks, where we decouple the Bernoulli parameters from the LLM loss, thus facilitating an efficient optimization via a policy gradient estimator without back-propagation. As a result, our method is able to 1) operate at structural granularities of channels, heads, and layers, 2) support global and heterogeneous pruning (i.e., our method automatically determines different redundancy for different layers), and 3) optionally use a metric-based method as initialization (of our Bernoulli distributions). Extensive experiments on LLaMA, LLaMA-2, and Vicuna using the C4 and WikiText2 datasets demonstrate that our method operates for 2.7 hours with around 35GB memory for the 13B models on a single A100 GPU, and our pruned models outperform the state-of-the-arts w.r.t. perplexity. Codes will be released.

6/18/2024

cs.LG cs.CL stat.ML

💬

Beyond Size: How Gradients Shape Pruning Decisions in Large Language Models

Rocktim Jyoti Das, Mingjie Sun, Liqun Ma, Zhiqiang Shen

Large Language Models (LLMs) with billions of parameters are prime targets for network pruning, removing some model weights without hurting performance. Prior approaches such as magnitude pruning, SparseGPT, and Wanda, either concentrated solely on weights or integrated weights with activations for sparsity. However, they overlooked the informative gradients derived from pretrained LLMs. In this paper, we present a novel sparsity-centric pruning method for pretrained LLMs, termed Gradient-based Language Model Pruner (GBLM-Pruner). GBLM-Pruner leverages the first-order term of the Taylor expansion, operating in a training-free manner by harnessing properly normalized gradients from a few calibration samples to determine the pruning metric, and substantially outperforms competitive counterparts like SparseGPT and Wanda in multiple benchmarks. Intriguingly, by incorporating gradients, unstructured pruning with our method tends to reveal some structural patterns, which mirrors the geometric interdependence inherent in the LLMs' parameter structure. Additionally, GBLM-Pruner functions without any subsequent retraining or weight updates to maintain its simplicity as other counterparts. Extensive evaluations on LLaMA-1 and LLaMA-2 across various benchmarks show that GBLM-Pruner surpasses magnitude pruning, Wanda and SparseGPT by significant margins. We further extend our approach on Vision Transformer. Our code and models are available at https://github.com/VILA-Lab/GBLM-Pruner.

4/10/2024

cs.CL cs.AI cs.LG

A Simple and Effective Pruning Approach for Large Language Models

Mingjie Sun, Zhuang Liu, Anna Bair, J. Zico Kolter

As their size increases, Large Languages Models (LLMs) are natural candidates for network pruning methods: approaches that drop a subset of network weights while striving to preserve performance. Existing methods, however, require either retraining, which is rarely affordable for billion-scale LLMs, or solving a weight reconstruction problem reliant on second-order information, which may also be computationally expensive. In this paper, we introduce a novel, straightforward yet effective pruning method, termed Wanda (Pruning by Weights and activations), designed to induce sparsity in pretrained LLMs. Motivated by the recent observation of emergent large magnitude features in LLMs, our approach prunes weights with the smallest magnitudes multiplied by the corresponding input activations, on a per-output basis. Notably, Wanda requires no retraining or weight update, and the pruned LLM can be used as is. We conduct a thorough evaluation of our method Wanda on LLaMA and LLaMA-2 across various language benchmarks. Wanda significantly outperforms the established baseline of magnitude pruning and performs competitively against recent method involving intensive weight update. Code is available at https://github.com/locuslab/wanda.

5/7/2024

cs.CL cs.AI cs.LG

Efficient Pruning of Large Language Model with Adaptive Estimation Fusion

Jun Liu, Chao Wu, Changdi Yang, Hao Tang, Zhenglun Kong, Geng Yuan, Wei Niu, Dong Huang, Yanzhi Wang

Large language models (LLMs) have become crucial for many generative downstream tasks, leading to an inevitable trend and significant challenge to deploy them efficiently on resource-constrained devices. Structured pruning is a widely used method to address this challenge. However, when dealing with the complex structure of the multiple decoder layers, general methods often employ common estimation approaches for pruning. These approaches lead to a decline in accuracy for specific downstream tasks. In this paper, we introduce a simple yet efficient method that adaptively models the importance of each substructure. Meanwhile, it can adaptively fuse coarse-grained and finegrained estimations based on the results from complex and multilayer structures. All aspects of our design seamlessly integrate into the endto-end pruning framework. Our experimental results, compared with state-of-the-art methods on mainstream datasets, demonstrate average accuracy improvements of 1.1%, 1.02%, 2.0%, and 1.2% for LLaMa-7B,Vicuna-7B, Baichuan-7B, and Bloom-7b1, respectively.

5/16/2024

cs.CL cs.AI cs.LG