Pruner-Zero: Evolving Symbolic Pruning Metric from scratch for Large Language Models

Read original: arXiv:2406.02924 - Published 6/6/2024 by Peijie Dong, Lujun Li, Zhenheng Tang, Xiang Liu, Xinglin Pan, Qiang Wang, Xiaowen Chu

Pruner-Zero: Evolving Symbolic Pruning Metric from scratch for Large Language Models

Overview

The paper proposes a novel approach called "Pruner-Zero" for pruning large language models (LLMs) by evolving a symbolic pruning metric from scratch.
The goal is to develop a data-driven pruning strategy that can effectively compress LLMs without significant performance degradation.
The researchers use symbolic regression to automatically discover a compact, interpretable pruning metric, avoiding the need for manual feature engineering or complex pruning algorithms.

Plain English Explanation

The paper introduces a technique called "Pruner-Zero" that aims to make large language models (LLMs) more efficient by selectively removing unnecessary parts of the model. LLMs are powerful AI systems that can understand and generate human-like text, but they can also be very large and computationally intensive.

The key idea behind Pruner-Zero is to automatically discover a simple, mathematical formula (called a "symbolic pruning metric") that can quickly identify which parts of the LLM are the least important and can be safely removed without significantly affecting the model's performance. This is done using a process called "symbolic regression," where a computer algorithm iteratively tries different mathematical expressions to find the one that best matches the data.

By using this data-driven approach, the researchers hope to avoid the need for manual feature engineering or complex pruning algorithms, making the process of compressing LLMs more efficient and accessible. The goal is to create smaller, more efficient LLMs that can be deployed on a wider range of devices and applications, while still maintaining their impressive language capabilities.

Technical Explanation

The paper proposes a novel approach called "Pruner-Zero" for pruning large language models (LLMs) by evolving a symbolic pruning metric from scratch. The researchers use symbolic regression to automatically discover a compact, interpretable pruning metric, avoiding the need for manual feature engineering or complex pruning algorithms.

The Pruner-Zero approach involves three main steps:

Symbolic Regression: The researchers use a symbolic regression algorithm to automatically discover a pruning metric, represented as a mathematical expression. This metric is designed to predict the importance of each parameter in the LLM, without any prior knowledge about the model's architecture or training process.
Pruning and Fine-tuning: Once the symbolic pruning metric is obtained, the researchers prune the LLM by removing the least important parameters according to the metric. The pruned model is then fine-tuned on the original task to recover any lost performance.
Iterative Pruning and Metric Refinement: The process of pruning and fine-tuning is repeated iteratively, with the symbolic pruning metric being refined at each step to improve the pruning decisions.

The researchers evaluate the Pruner-Zero approach on several large language models, including GPT-2 and GPT-3. They demonstrate that the automatically discovered symbolic pruning metrics can achieve significant model compression (up to 90% of parameters) with minimal performance degradation, outperforming existing pruning methods.

Critical Analysis

The Pruner-Zero approach presents several advantages over traditional pruning methods. By using symbolic regression to discover the pruning metric, the researchers avoid the need for manual feature engineering or complex pruning algorithms, making the process more efficient and accessible. The resulting pruning metric is also interpretable, which can provide insights into the underlying structure and importance of different parts of the LLM.

However, the paper does not address the potential limitations or drawbacks of the Pruner-Zero approach. For example, the symbolic regression process may be computationally intensive, especially for large and complex LLMs. Additionally, the generalization of the discovered pruning metrics to different tasks or LLM architectures is not thoroughly explored.

Furthermore, the paper does not discuss the potential impact of pruning on the fairness, robustness, or safety of the LLMs. Removing a significant portion of the model parameters may inadvertently affect the model's behavior in unpredictable ways, which could have important implications for real-world applications.

Overall, the Pruner-Zero approach is a promising step towards more efficient and interpretable model pruning for large language models. However, further research is needed to address the potential limitations and explore the broader implications of this technique.

Conclusion

The Pruner-Zero paper presents a novel approach for pruning large language models by automatically discovering a symbolic pruning metric through symbolic regression. This data-driven approach avoids the need for manual feature engineering or complex pruning algorithms, making the model compression process more efficient and accessible.

The key contribution of the paper is the development of a compact, interpretable pruning metric that can achieve significant model compression (up to 90% of parameters) with minimal performance degradation, outperforming existing pruning methods. This has important implications for deploying large language models on a wider range of devices and applications, where computational resources are limited.

While the Pruner-Zero approach shows promise, further research is needed to address potential limitations, such as the computational complexity of the symbolic regression process and the potential impact of pruning on the fairness, robustness, and safety of the language models. Nonetheless, this work represents an important step towards more efficient and interpretable model compression techniques for large language models.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Pruner-Zero: Evolving Symbolic Pruning Metric from scratch for Large Language Models

Peijie Dong, Lujun Li, Zhenheng Tang, Xiang Liu, Xinglin Pan, Qiang Wang, Xiaowen Chu

Despite the remarkable capabilities, Large Language Models (LLMs) face deployment challenges due to their extensive size. Pruning methods drop a subset of weights to accelerate, but many of them require retraining, which is prohibitively expensive and computationally demanding. Recently, post-training pruning approaches introduced novel metrics, enabling the pruning of LLMs without retraining. However, these metrics require the involvement of human experts and tedious trial and error. To efficiently identify superior pruning metrics, we develop an automatic framework for searching symbolic pruning metrics using genetic programming. In particular, we devise an elaborate search space encompassing the existing pruning metrics to discover the potential symbolic pruning metric. We propose an opposing operation simplification strategy to increase the diversity of the population. In this way, Pruner-Zero allows auto-generation of symbolic pruning metrics. Based on the searched results, we explore the correlation between pruning metrics and performance after pruning and summarize some principles. Extensive experiments on LLaMA and LLaMA-2 on language modeling and zero-shot tasks demonstrate that our Pruner-Zero obtains superior performance than SOTA post-training pruning methods. Code at: url{https://github.com/pprp/Pruner-Zero}.

6/6/2024

Optimization-based Structural Pruning for Large Language Models without Back-Propagation

Yuan Gao, Zujing Liu, Weizhong Zhang, Bo Du, Gui-Song Xia

Compared to the moderate size of neural network models, structural weight pruning on the Large-Language Models (LLMs) imposes a novel challenge on the efficiency of the pruning algorithms, due to the heavy computation/memory demands of the LLMs. Recent efficient LLM pruning methods typically operate at the post-training phase without the expensive weight finetuning, however, their pruning criteria often rely on heuristically designed metrics, potentially leading to suboptimal performance. We instead propose a novel optimization-based structural pruning that learns the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model. To preserve the efficiency, our method 1) works at post-training phase} and 2) eliminates the back-propagation through the LLM per se during the optimization (i.e., only requires the forward pass of the LLM). We achieve this by learning an underlying Bernoulli distribution to sample binary pruning masks, where we decouple the Bernoulli parameters from the LLM loss, thus facilitating an efficient optimization via a policy gradient estimator without back-propagation. As a result, our method is able to 1) operate at structural granularities of channels, heads, and layers, 2) support global and heterogeneous pruning (i.e., our method automatically determines different redundancy for different layers), and 3) optionally use a metric-based method as initialization (of our Bernoulli distributions). Extensive experiments on LLaMA, LLaMA-2, and Vicuna using the C4 and WikiText2 datasets demonstrate that our method operates for 2.7 hours with around 35GB memory for the 13B models on a single A100 GPU, and our pruned models outperform the state-of-the-arts w.r.t. perplexity. Codes will be released.

6/18/2024

🏷️

Zero-TPrune: Zero-Shot Token Pruning through Leveraging of the Attention Graph in Pre-Trained Transformers

Hongjie Wang, Bhishma Dedhia, Niraj K. Jha

Deployment of Transformer models on edge devices is becoming increasingly challenging due to the exponentially growing inference cost that scales quadratically with the number of tokens in the input sequence. Token pruning is an emerging solution to address this challenge due to its ease of deployment on various Transformer backbones. However, most token pruning methods require computationally expensive fine-tuning, which is undesirable in many edge deployment cases. In this work, we propose Zero-TPrune, the first zero-shot method that considers both the importance and similarity of tokens in performing token pruning. It leverages the attention graph of pre-trained Transformer models to produce an importance distribution for tokens via our proposed Weighted Page Rank (WPR) algorithm. This distribution further guides token partitioning for efficient similarity-based pruning. Due to the elimination of the fine-tuning overhead, Zero-TPrune can prune large models at negligible computational cost, switch between different pruning configurations at no computational cost, and perform hyperparameter tuning efficiently. We evaluate the performance of Zero-TPrune on vision tasks by applying it to various vision Transformer backbones and testing them on ImageNet. Without any fine-tuning, Zero-TPrune reduces the FLOPs cost of DeiT-S by 34.7% and improves its throughput by 45.3% with only 0.4% accuracy loss. Compared with state-of-the-art pruning methods that require fine-tuning, Zero-TPrune not only eliminates the need for fine-tuning after pruning but also does so with only 0.1% accuracy loss. Compared with state-of-the-art fine-tuning-free pruning methods, Zero-TPrune reduces accuracy loss by up to 49% with similar FLOPs budgets. Project webpage: https://jha-lab.github.io/zerotprune.

4/9/2024

🛠️

ALPS: Improved Optimization for Highly Sparse One-Shot Pruning for Large Language Models

Xiang Meng, Kayhan Behdin, Haoyue Wang, Rahul Mazumder

The impressive performance of Large Language Models (LLMs) across various natural language processing tasks comes at the cost of vast computational resources and storage requirements. One-shot pruning techniques offer a way to alleviate these burdens by removing redundant weights without the need for retraining. Yet, the massive scale of LLMs often forces current pruning approaches to rely on heuristics instead of optimization-based techniques, potentially resulting in suboptimal compression. In this paper, we introduce ALPS, an optimization-based framework that tackles the pruning problem using the operator splitting technique and a preconditioned conjugate gradient-based post-processing step. Our approach incorporates novel techniques to accelerate and theoretically guarantee convergence while leveraging vectorization and GPU parallelism for efficiency. ALPS substantially outperforms state-of-the-art methods in terms of the pruning objective and perplexity reduction, particularly for highly sparse models. On the OPT-30B model with 70% sparsity, ALPS achieves a 13% reduction in test perplexity on the WikiText dataset and a 19% improvement in zero-shot benchmark performance compared to existing methods.

6/13/2024