NutePrune: Efficient Progressive Pruning with Numerous Teachers for Large Language Models

2402.09773

Published 6/28/2024 by Shengrui Li, Junzhe Chen, Xueting Han, Jing Bai

NutePrune: Efficient Progressive Pruning with Numerous Teachers for Large Language Models

Abstract

The considerable size of Large Language Models (LLMs) presents notable deployment challenges, particularly on resource-constrained hardware. Structured pruning, offers an effective means to compress LLMs, thereby reducing storage costs and enhancing inference speed for more efficient utilization. In this work, we study data-efficient and resource-efficient structure pruning methods to obtain smaller yet still powerful models. Knowledge Distillation is well-suited for pruning, as the intact model can serve as an excellent teacher for pruned students. However, it becomes challenging in the context of LLMs due to memory constraints. To address this, we propose an efficient progressive Numerous-teacher pruning method (NutePrune). NutePrune mitigates excessive memory costs by loading only one intact model and integrating it with various masks and LoRA modules, enabling it to seamlessly switch between teacher and student roles. This approach allows us to leverage numerous teachers with varying capacities to progressively guide the pruned model, enhancing overall performance. Extensive experiments across various tasks demonstrate the effectiveness of NutePrune. In LLaMA-7B zero-shot experiments, NutePrune retains 97.17% of the performance of the original model at 20% sparsity and 95.07% at 25% sparsity. Our code is available at https://github.com/Lucius-lsr/NutePrune.

Create account to get full access

Overview

This paper introduces NutePrune, a new efficient and progressive pruning method for large language models (LLMs) that leverages numerous teacher models.
NutePrune aims to achieve high model compression and inference speedup while maintaining model performance.
The key ideas are to prune the student model progressively with the guidance of multiple teacher models, and to use an adaptive estimation technique to balance the tradeoff between accuracy and model size.

Plain English Explanation

NutePrune: Efficient Progressive Pruning with Numerous Teachers for Large Language Models is a new technique for reducing the size and improving the efficiency of large language models.

Large language models are very powerful, but they can also be very large and computationally intensive to run. Pruning is a technique used to remove unnecessary parts of the model to make it smaller and faster, while trying to maintain its performance.

The key innovation in NutePrune is that it uses multiple "teacher" models to guide the pruning process, rather than just a single teacher. By leveraging the knowledge of many different models, NutePrune is able to prune the main "student" model more effectively, making it smaller and faster without losing too much accuracy.

NutePrune also uses an "adaptive estimation" technique to carefully balance the tradeoff between the model's size, speed, and performance. This helps ensure that the pruned model remains highly capable even after significant compression.

Overall, NutePrune provides an efficient way to optimize large language models for real-world deployment, making them more practical to use in resource-constrained environments like mobile devices or embedded systems.

Technical Explanation

NutePrune: Efficient Progressive Pruning with Numerous Teachers for Large Language Models presents a new pruning method for compressing large language models (LLMs) while maintaining their performance.

The key components of NutePrune are:

Progressive Pruning: The model is pruned in an iterative, stage-wise fashion, with the pruned weights gradually increased in each stage. This allows the model to adapt to the progressive changes.
Numerous Teacher Models: Rather than using a single teacher model to guide the pruning, NutePrune leverages multiple teacher models. This provides richer guidance and helps the student model maintain performance.
Adaptive Estimation: An adaptive technique is used to balance the tradeoff between model size/speed and performance. This helps determine the optimal level of pruning at each stage.

The paper demonstrates NutePrune's effectiveness on large language models like BERT and GPT-2. Compared to previous pruning methods, NutePrune achieves higher model compression (up to 8x) and inference speedup (up to 5x) while maintaining competitive task performance.

The authors also provide analysis on the impact of the number of teacher models and the adaptive estimation technique on the pruning results.

Critical Analysis

The NutePrune paper presents a compelling approach for efficiently pruning large language models. The use of multiple teacher models and the adaptive estimation technique are novel and appear to be effective in balancing the tradeoffs between model size, speed, and performance.

However, the paper does not extensively explore the limitations of the method. For example, it would be valuable to understand how the performance of NutePrune scales as the size of the original language model increases, or how it compares to other recent pruning techniques like Optimization-Based Structural Pruning or Sheared LLaMA.

Additionally, the paper does not address potential issues around the increased complexity of managing multiple teacher models, or how the method might perform on more specialized language tasks beyond the standard benchmarks.

Overall, NutePrune appears to be a promising technique, but further research and evaluation would be helpful to fully understand its strengths, weaknesses, and applicability to real-world large language model deployments.

Conclusion

NutePrune: Efficient Progressive Pruning with Numerous Teachers for Large Language Models introduces a novel pruning method that leverages multiple teacher models and adaptive estimation to achieve high model compression and inference speedup while maintaining the performance of large language models.

The key innovations of NutePrune, such as its progressive pruning approach and use of numerous teacher models, demonstrate the potential for improving the efficiency and practicality of deploying large, powerful language models in resource-constrained environments. As the size and complexity of language models continue to grow, techniques like NutePrune will become increasingly important for making these models more accessible and practical for real-world applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Efficient Pruning of Large Language Model with Adaptive Estimation Fusion

Jun Liu, Chao Wu, Changdi Yang, Hao Tang, Zhenglun Kong, Geng Yuan, Wei Niu, Dong Huang, Yanzhi Wang

Large language models (LLMs) have become crucial for many generative downstream tasks, leading to an inevitable trend and significant challenge to deploy them efficiently on resource-constrained devices. Structured pruning is a widely used method to address this challenge. However, when dealing with the complex structure of the multiple decoder layers, general methods often employ common estimation approaches for pruning. These approaches lead to a decline in accuracy for specific downstream tasks. In this paper, we introduce a simple yet efficient method that adaptively models the importance of each substructure. Meanwhile, it can adaptively fuse coarse-grained and finegrained estimations based on the results from complex and multilayer structures. All aspects of our design seamlessly integrate into the endto-end pruning framework. Our experimental results, compared with state-of-the-art methods on mainstream datasets, demonstrate average accuracy improvements of 1.1%, 1.02%, 2.0%, and 1.2% for LLaMa-7B,Vicuna-7B, Baichuan-7B, and Bloom-7b1, respectively.

5/16/2024

cs.CL cs.AI cs.LG

Optimization-based Structural Pruning for Large Language Models without Back-Propagation

Yuan Gao, Zujing Liu, Weizhong Zhang, Bo Du, Gui-Song Xia

Compared to the moderate size of neural network models, structural weight pruning on the Large-Language Models (LLMs) imposes a novel challenge on the efficiency of the pruning algorithms, due to the heavy computation/memory demands of the LLMs. Recent efficient LLM pruning methods typically operate at the post-training phase without the expensive weight finetuning, however, their pruning criteria often rely on heuristically designed metrics, potentially leading to suboptimal performance. We instead propose a novel optimization-based structural pruning that learns the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model. To preserve the efficiency, our method 1) works at post-training phase} and 2) eliminates the back-propagation through the LLM per se during the optimization (i.e., only requires the forward pass of the LLM). We achieve this by learning an underlying Bernoulli distribution to sample binary pruning masks, where we decouple the Bernoulli parameters from the LLM loss, thus facilitating an efficient optimization via a policy gradient estimator without back-propagation. As a result, our method is able to 1) operate at structural granularities of channels, heads, and layers, 2) support global and heterogeneous pruning (i.e., our method automatically determines different redundancy for different layers), and 3) optionally use a metric-based method as initialization (of our Bernoulli distributions). Extensive experiments on LLaMA, LLaMA-2, and Vicuna using the C4 and WikiText2 datasets demonstrate that our method operates for 2.7 hours with around 35GB memory for the 13B models on a single A100 GPU, and our pruned models outperform the state-of-the-arts w.r.t. perplexity. Codes will be released.

6/18/2024

cs.LG cs.CL stat.ML

💬

Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning

Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, Danqi Chen

The popularity of LLaMA (Touvron et al., 2023a;b) and other recently emerged moderate-sized large language models (LLMs) highlights the potential of building smaller yet powerful LLMs. Regardless, the cost of training such models from scratch on trillions of tokens remains high. In this work, we study structured pruning as an effective means to develop smaller LLMs from pre-trained, larger models. Our approach employs two key techniques: (1) targeted structured pruning, which prunes a larger model to a specified target shape by removing layers, heads, and intermediate and hidden dimensions in an end-to-end manner, and (2) dynamic batch loading, which dynamically updates the composition of sampled data in each training batch based on varying losses across different domains. We demonstrate the efficacy of our approach by presenting the Sheared-LLaMA series, pruning the LLaMA2-7B model down to 1.3B and 2.7B parameters. Sheared-LLaMA models outperform state-of-the-art open-source models of equivalent sizes, such as Pythia, INCITE, OpenLLaMA and the concurrent TinyLlama models, on a wide range of downstream and instruction tuning evaluations, while requiring only 3% of compute compared to training such models from scratch. This work provides compelling evidence that leveraging existing LLMs with structured pruning is a far more cost-effective approach for building competitive small-scale LLMs

4/12/2024

cs.CL cs.AI cs.LG

Large Language Model Pruning

Hanjuan Huang (Dept. of Computer Science and Information Engineering National Taiwan University of Science and Technology, Taipei, Taiwan), Hao-Jia Song (Dept. of Computer Science and Information Engineering National Taiwan University of Science and Technology, Taipei, Taiwan), Hsing-Kuo Pao (Dept. of Computer Science and Information Engineering National Taiwan University of Science and Technology, Taipei, Taiwan)

We surely enjoy the larger the better models for their superior performance in the last couple of years when both the hardware and software support the birth of such extremely huge models. The applied fields include text mining and others. In particular, the success of LLMs on text understanding and text generation draws attention from researchers who have worked on NLP and related areas for years or even decades. On the side, LLMs may suffer from problems like model overfitting, hallucination, and device limitation to name a few. In this work, we suggest a model pruning technique specifically focused on LLMs. The proposed methodology emphasizes the explainability of deep learning models. By having the theoretical foundation, we obtain a trustworthy deep model so that huge models with a massive number of model parameters become not quite necessary. A mutual information-based estimation is adopted to find neurons with redundancy to eliminate. Moreover, an estimator with well-tuned parameters helps to find precise estimation to guide the pruning procedure. At the same time, we also explore the difference between pruning on large-scale models vs. pruning on small-scale models. The choice of pruning criteria is sensitive in small models but not for large-scale models. It is a novel finding through this work. Overall, we demonstrate the superiority of the proposed model to the state-of-the-art models.

6/4/2024

cs.CL cs.AI cs.LG