Beyond Size: How Gradients Shape Pruning Decisions in Large Language Models

2311.04902

Published 4/10/2024 by Rocktim Jyoti Das, Mingjie Sun, Liqun Ma, Zhiqiang Shen

💬

Abstract

Large Language Models (LLMs) with billions of parameters are prime targets for network pruning, removing some model weights without hurting performance. Prior approaches such as magnitude pruning, SparseGPT, and Wanda, either concentrated solely on weights or integrated weights with activations for sparsity. However, they overlooked the informative gradients derived from pretrained LLMs. In this paper, we present a novel sparsity-centric pruning method for pretrained LLMs, termed Gradient-based Language Model Pruner (GBLM-Pruner). GBLM-Pruner leverages the first-order term of the Taylor expansion, operating in a training-free manner by harnessing properly normalized gradients from a few calibration samples to determine the pruning metric, and substantially outperforms competitive counterparts like SparseGPT and Wanda in multiple benchmarks. Intriguingly, by incorporating gradients, unstructured pruning with our method tends to reveal some structural patterns, which mirrors the geometric interdependence inherent in the LLMs' parameter structure. Additionally, GBLM-Pruner functions without any subsequent retraining or weight updates to maintain its simplicity as other counterparts. Extensive evaluations on LLaMA-1 and LLaMA-2 across various benchmarks show that GBLM-Pruner surpasses magnitude pruning, Wanda and SparseGPT by significant margins. We further extend our approach on Vision Transformer. Our code and models are available at https://github.com/VILA-Lab/GBLM-Pruner.

Create account to get full access

Overview

Researchers present a novel sparsity-centric pruning method called GBLM-Pruner for pretrained large language models (LLMs) with billions of parameters.
Prior approaches focused on pruning weights or integrating weights with activations, but overlooked the informative gradients from pretrained LLMs.
GBLM-Pruner leverages the first-order term of the Taylor expansion to determine the pruning metric, operating in a training-free manner using properly normalized gradients from a few calibration samples.
GBLM-Pruner substantially outperforms competitive methods like SparseGPT and Wanda across multiple benchmarks.
Incorporating gradients leads to unstructured pruning that reveals structural patterns, mirroring the geometric interdependence in LLM parameters.
GBLM-Pruner is simple, as it functions without any subsequent retraining or weight updates.

Plain English Explanation

Large language models (LLMs) with billions of parameters are powerful, but also very storage-intensive and computationally expensive. Researchers have been exploring ways to "prune" these models, which means removing some of the model weights (parameters) without significantly hurting their performance.

Prior pruning methods have focused on either just the model weights or a combination of weights and activations (the outputs of each layer). However, these approaches have overlooked an important aspect of LLMs: the informative gradients, which are the small changes in the model's internal parameters that help it learn and improve.

The researchers in this paper have developed a new pruning method called GBLM-Pruner that specifically leverages these gradients. It does this in a "training-free" way, meaning it can prune the model without having to retrain it from scratch. GBLM-Pruner uses a few calibration samples to get a sense of the model's gradients, and then uses this information to determine which weights are the least important and can be safely removed.

Importantly, GBLM-Pruner significantly outperforms other state-of-the-art pruning methods, like SparseGPT and Wanda, across a variety of benchmarks. And by incorporating the gradients, GBLM-Pruner's unstructured pruning (removing individual weights rather than entire model components) actually reveals some interesting structural patterns in the LLM's parameters, suggesting there is an inherent geometric interdependence in how these massive models are structured.

Overall, GBLM-Pruner provides a powerful and efficient way to compress large language models without significantly impacting their performance, which could have important implications for deploying these models in real-world applications.

Technical Explanation

The researchers present a novel sparsity-centric pruning method called Gradient-based Language Model Pruner (GBLM-Pruner) for pretrained large language models (LLMs). Prior approaches, such as magnitude pruning, SparseGPT, and Wanda, have focused either solely on the model weights or integrated weights with activations for sparsity, but overlooked the informative gradients derived from the pretrained LLMs.

GBLM-Pruner leverages the first-order term of the Taylor expansion to determine the pruning metric in a training-free manner. It harnesses properly normalized gradients from a few calibration samples to identify the least important weights to prune. This approach substantially outperforms competitive methods like SparseGPT and Wanda across multiple benchmarks.

Interestingly, by incorporating gradients, the unstructured pruning performed by GBLM-Pruner (removing individual weights rather than entire model components) tends to reveal some structural patterns. This suggests an inherent geometric interdependence in the parameter structure of these large language models.

Additionally, GBLM-Pruner functions without any subsequent retraining or weight updates, maintaining its simplicity compared to other counterparts that require more complex fine-tuning or retraining steps.

The researchers extensively evaluate GBLM-Pruner on the LLaMA-1 and LLaMA-2 language models, as well as extending the approach to the Vision Transformer. The results show that GBLM-Pruner surpasses magnitude pruning, Wanda, and SparseGPT by significant margins across various benchmarks.

Critical Analysis

The researchers provide a comprehensive evaluation of GBLM-Pruner and demonstrate its superiority over other state-of-the-art pruning methods. However, the paper does not delve into potential limitations or areas for further research.

One aspect that could be explored further is the specific structural patterns that emerge from GBLM-Pruner's unstructured pruning. While the researchers note that these patterns mirror the geometric interdependence in the LLM's parameter structure, a deeper analysis of these patterns and their implications could yield additional insights.

Additionally, the paper focuses on evaluating GBLM-Pruner on language models and the Vision Transformer. It would be interesting to see how the method performs on other types of large neural networks, such as those used in computer vision or other domains, to assess its broader applicability.

Finally, the researchers could have discussed potential real-world implications and use cases for GBLM-Pruner, beyond just the academic significance of the method. Exploring how this pruning technique could enable more efficient deployment of large language models in practical applications would further strengthen the paper's impact.

Overall, the GBLM-Pruner approach represents a significant advancement in the field of model compression and could have meaningful implications for the development and deployment of large, resource-intensive neural networks.

Conclusion

The researchers have presented GBLM-Pruner, a novel sparsity-centric pruning method for pretrained large language models (LLMs) that leverages the informative gradients derived from these models. By harnessing properly normalized gradients from a few calibration samples, GBLM-Pruner is able to determine the pruning metric in a training-free manner and significantly outperform other state-of-the-art pruning techniques.

Notably, the incorporation of gradients in GBLM-Pruner's unstructured pruning process reveals interesting structural patterns, suggesting an inherent geometric interdependence in the parameter structure of LLMs. This finding could lead to further insights into the internal workings and representations of these powerful models.

The simplicity and effectiveness of GBLM-Pruner, along with its strong performance across multiple benchmarks, make it a promising tool for enabling more efficient deployment of large language models in real-world applications. As the field of AI continues to push the boundaries of model size and complexity, innovations like GBLM-Pruner will be crucial for balancing model performance and resource constraints.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

A Simple and Effective Pruning Approach for Large Language Models

Mingjie Sun, Zhuang Liu, Anna Bair, J. Zico Kolter

As their size increases, Large Languages Models (LLMs) are natural candidates for network pruning methods: approaches that drop a subset of network weights while striving to preserve performance. Existing methods, however, require either retraining, which is rarely affordable for billion-scale LLMs, or solving a weight reconstruction problem reliant on second-order information, which may also be computationally expensive. In this paper, we introduce a novel, straightforward yet effective pruning method, termed Wanda (Pruning by Weights and activations), designed to induce sparsity in pretrained LLMs. Motivated by the recent observation of emergent large magnitude features in LLMs, our approach prunes weights with the smallest magnitudes multiplied by the corresponding input activations, on a per-output basis. Notably, Wanda requires no retraining or weight update, and the pruned LLM can be used as is. We conduct a thorough evaluation of our method Wanda on LLaMA and LLaMA-2 across various language benchmarks. Wanda significantly outperforms the established baseline of magnitude pruning and performs competitively against recent method involving intensive weight update. Code is available at https://github.com/locuslab/wanda.

5/7/2024

cs.CL cs.AI cs.LG

SparseLLM: Towards Global Pruning for Pre-trained Language Models

Guangji Bai, Yijiang Li, Chen Ling, Kibaek Kim, Liang Zhao

The transformative impact of large language models (LLMs) like LLaMA and GPT on natural language processing is countered by their prohibitive computational demands. Pruning has emerged as a pivotal compression strategy, introducing sparsity to enhance both memory and computational efficiency. Yet, traditional global pruning is impractical for LLMs due to scalability issues, while local pruning, despite its efficiency, leads to suboptimal solutions. Addressing these challenges, we propose SparseLLM, a novel framework that redefines the global pruning process into manageable, coordinated subproblems, allowing for resource-efficient optimization with global optimality. SparseLLM's approach, which conceptualizes LLMs as a chain of modular functions and leverages auxiliary variables for problem decomposition, not only facilitates a pragmatic application on LLMs but also demonstrates significant performance improvements, particularly in high-sparsity regimes where it surpasses current state-of-the-art methods.

5/27/2024

cs.CL

Optimization-based Structural Pruning for Large Language Models without Back-Propagation

Yuan Gao, Zujing Liu, Weizhong Zhang, Bo Du, Gui-Song Xia

Compared to the moderate size of neural network models, structural weight pruning on the Large-Language Models (LLMs) imposes a novel challenge on the efficiency of the pruning algorithms, due to the heavy computation/memory demands of the LLMs. Recent efficient LLM pruning methods typically operate at the post-training phase without the expensive weight finetuning, however, their pruning criteria often rely on heuristically designed metrics, potentially leading to suboptimal performance. We instead propose a novel optimization-based structural pruning that learns the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model. To preserve the efficiency, our method 1) works at post-training phase} and 2) eliminates the back-propagation through the LLM per se during the optimization (i.e., only requires the forward pass of the LLM). We achieve this by learning an underlying Bernoulli distribution to sample binary pruning masks, where we decouple the Bernoulli parameters from the LLM loss, thus facilitating an efficient optimization via a policy gradient estimator without back-propagation. As a result, our method is able to 1) operate at structural granularities of channels, heads, and layers, 2) support global and heterogeneous pruning (i.e., our method automatically determines different redundancy for different layers), and 3) optionally use a metric-based method as initialization (of our Bernoulli distributions). Extensive experiments on LLaMA, LLaMA-2, and Vicuna using the C4 and WikiText2 datasets demonstrate that our method operates for 2.7 hours with around 35GB memory for the 13B models on a single A100 GPU, and our pruned models outperform the state-of-the-arts w.r.t. perplexity. Codes will be released.

6/18/2024

cs.LG cs.CL stat.ML

MoreauPruner: Robust Pruning of Large Language Models against Weight Perturbations

Zixiao Wang, Jingwei Zhang, Wenqian Zhao, Farzan Farnia, Bei Yu

Few-shot gradient methods have been extensively utilized in existing model pruning methods, where the model weights are regarded as static values and the effects of potential weight perturbations are not considered. However, the widely used large language models (LLMs) have several billion model parameters, which could increase the fragility of few-shot gradient pruning. In this work, we experimentally show that one-shot gradient pruning algorithms could lead to unstable results under perturbations to model weights. And the minor error of switching between data formats bfloat16 and float16 could result in drastically different outcomes. To address such instabilities, we leverage optimization analysis and propose an LLM structural pruning method, called MoreauPruner, with provable robustness against weight perturbations. In MoreauPruner, the model weight importance is estimated based on the neural network's Moreau envelope, which can be flexibly combined with $ell_1$-norm regularization techniques to induce the sparsity required in the pruning task. We extensively evaluate the MoreauPruner algorithm on several well-known LLMs, including LLaMA-7B, LLaMA-13B, LLaMA3-8B, and Vicuna-7B. Our numerical results suggest the robustness of MoreauPruner against weight perturbations, and indicate the MoreauPruner's successful accuracy-based scores in comparison to several existing pruning methods. We have released the code in url{https://github.com/ShiningSord/MoreauPruner}.

6/12/2024

cs.LG cs.CL