Shortened LLaMA: Depth Pruning for Large Language Models with Comparison of Retraining Methods

2402.02834

Published 6/26/2024 by Bo-Kyeong Kim, Geonmin Kim, Tae-Ho Kim, Thibault Castells, Shinkook Choi, Junho Shin, Hyoung-Kyu Song

cs.LG cs.CL

💬

Abstract

Structured pruning of modern large language models (LLMs) has emerged as a way of decreasing their high computational needs. Width pruning reduces the size of projection weight matrices (e.g., by removing attention heads) while maintaining the number of layers. Depth pruning, in contrast, removes entire layers or blocks, while keeping the size of the remaining weights unchanged. Most current research focuses on either width-only or a blend of width and depth pruning, with little comparative analysis between the two units (width vs. depth) concerning their impact on LLM inference efficiency. In this work, we show that simple depth pruning can effectively compress LLMs while achieving comparable or superior performance to recent width pruning studies. Our pruning method boosts inference speeds, especially under memory-constrained conditions that require limited batch sizes for running LLMs, where width pruning is ineffective. In retraining pruned models for quality recovery, continued pretraining on a large corpus markedly outperforms LoRA-based tuning, particularly at severe pruning ratios. We hope this work can help build compact yet capable LLMs. Code and models can be found at: https://github.com/Nota-NetsPresso/shortened-llm

Create account to get full access

Overview

The paper explores techniques for "pruning" or compressing modern large language models (LLMs) to reduce their high computational needs.
It compares two main approaches: "width pruning" (reducing the size of projection weight matrices) and "depth pruning" (removing entire layers or blocks).
The key finding is that simple depth pruning can effectively compress LLMs while achieving comparable or superior performance to recent width pruning methods.

Plain English Explanation

Large language models (LLMs) like GPT-3 and BERT are incredibly powerful, but they also require a lot of computing power to run. This makes them difficult to deploy in resource-constrained environments like mobile devices. To address this, researchers have been exploring ways to "prune" or compress these models without losing too much performance.

The paper looks at two main approaches to pruning:

Width pruning: This involves reducing the size of the weight matrices used in the model, for example by removing some of the attention heads. This can shrink the model's overall size, but the number of layers remains the same.
Depth pruning: This method removes entire layers or blocks from the model, while keeping the remaining weights unchanged. This can achieve more dramatic model compression.

The key finding is that simple depth pruning can be just as effective as more complex width pruning techniques, and in some cases even better. Depth pruning was especially helpful for improving inference speed (how quickly the model can make predictions) in situations where memory is limited, and the model has to run with small batch sizes.

When retraining the pruned models to recover performance, the researchers found that continued pretraining on a large corpus was much more effective than a technique called LoRA, particularly for models that were heavily pruned.

Overall, this work suggests that depth pruning could be a simpler and more effective way to build compact yet capable language models, especially for deployment on resource-constrained devices.

Technical Explanation

The paper explores two main approaches to structured pruning of large language models (LLMs):

Width pruning: This involves reducing the size of the projection weight matrices, such as by removing attention heads. The number of layers in the model remains unchanged.
Depth pruning: This method removes entire layers or blocks from the model, while keeping the size of the remaining weights unchanged. This can achieve more dramatic model compression.

Most prior research has focused on width-only pruning or a blend of width and depth pruning, with little comparative analysis between the two.

In this work, the authors show that simple depth pruning can effectively compress LLMs while achieving comparable or superior performance to recent width pruning studies. Their depth pruning method boosts inference speeds, especially under memory-constrained conditions that require limited batch sizes, where width pruning is less effective.

When retraining the pruned models to recover performance, the authors found that continued pretraining on a large corpus markedly outperforms LoRA-based tuning, particularly at severe pruning ratios.

Critical Analysis

The paper provides a thorough evaluation of depth pruning as an alternative to more commonly studied width pruning techniques. The authors acknowledge that their depth pruning method is a relatively simple approach, but argue that it can be just as effective as more complex methods.

One potential limitation of the research is that it primarily focuses on comparing depth and width pruning, without exploring more advanced pruning strategies that combine the two techniques, as mentioned in related work. It would be interesting to see how a hybrid approach might perform.

Additionally, the paper does not delve into the theoretical reasons why depth pruning may be more effective than width pruning in certain scenarios, such as memory-constrained inference. Further analysis of the underlying mechanisms could provide valuable insights.

Overall, the research makes a compelling case for depth pruning as a simple yet powerful technique for compressing large language models. The findings could have important implications for deploying these models in resource-constrained environments, and the authors' code and models provide a useful starting point for further exploration.

Conclusion

This paper demonstrates that simple depth pruning can be an effective and efficient way to compress large language models, often outperforming more complex width pruning techniques. The depth pruning method boosts inference speeds, especially in memory-constrained conditions, and continued pretraining is shown to be more effective than LoRA-based tuning for recovering model performance after pruning.

These findings suggest that depth pruning could be a promising approach for building compact yet capable language models, particularly for deployment on resource-constrained devices. The work contributes to the ongoing efforts to make large language models more accessible and practical for a wider range of applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

💬

Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning

Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, Danqi Chen

The popularity of LLaMA (Touvron et al., 2023a;b) and other recently emerged moderate-sized large language models (LLMs) highlights the potential of building smaller yet powerful LLMs. Regardless, the cost of training such models from scratch on trillions of tokens remains high. In this work, we study structured pruning as an effective means to develop smaller LLMs from pre-trained, larger models. Our approach employs two key techniques: (1) targeted structured pruning, which prunes a larger model to a specified target shape by removing layers, heads, and intermediate and hidden dimensions in an end-to-end manner, and (2) dynamic batch loading, which dynamically updates the composition of sampled data in each training batch based on varying losses across different domains. We demonstrate the efficacy of our approach by presenting the Sheared-LLaMA series, pruning the LLaMA2-7B model down to 1.3B and 2.7B parameters. Sheared-LLaMA models outperform state-of-the-art open-source models of equivalent sizes, such as Pythia, INCITE, OpenLLaMA and the concurrent TinyLlama models, on a wide range of downstream and instruction tuning evaluations, while requiring only 3% of compute compared to training such models from scratch. This work provides compelling evidence that leveraging existing LLMs with structured pruning is a far more cost-effective approach for building competitive small-scale LLMs

4/12/2024

cs.CL cs.AI cs.LG

Optimization-based Structural Pruning for Large Language Models without Back-Propagation

Yuan Gao, Zujing Liu, Weizhong Zhang, Bo Du, Gui-Song Xia

Compared to the moderate size of neural network models, structural weight pruning on the Large-Language Models (LLMs) imposes a novel challenge on the efficiency of the pruning algorithms, due to the heavy computation/memory demands of the LLMs. Recent efficient LLM pruning methods typically operate at the post-training phase without the expensive weight finetuning, however, their pruning criteria often rely on heuristically designed metrics, potentially leading to suboptimal performance. We instead propose a novel optimization-based structural pruning that learns the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model. To preserve the efficiency, our method 1) works at post-training phase} and 2) eliminates the back-propagation through the LLM per se during the optimization (i.e., only requires the forward pass of the LLM). We achieve this by learning an underlying Bernoulli distribution to sample binary pruning masks, where we decouple the Bernoulli parameters from the LLM loss, thus facilitating an efficient optimization via a policy gradient estimator without back-propagation. As a result, our method is able to 1) operate at structural granularities of channels, heads, and layers, 2) support global and heterogeneous pruning (i.e., our method automatically determines different redundancy for different layers), and 3) optionally use a metric-based method as initialization (of our Bernoulli distributions). Extensive experiments on LLaMA, LLaMA-2, and Vicuna using the C4 and WikiText2 datasets demonstrate that our method operates for 2.7 hours with around 35GB memory for the 13B models on a single A100 GPU, and our pruned models outperform the state-of-the-arts w.r.t. perplexity. Codes will be released.

6/18/2024

cs.LG cs.CL stat.ML

A Simple and Effective Pruning Approach for Large Language Models

Mingjie Sun, Zhuang Liu, Anna Bair, J. Zico Kolter

As their size increases, Large Languages Models (LLMs) are natural candidates for network pruning methods: approaches that drop a subset of network weights while striving to preserve performance. Existing methods, however, require either retraining, which is rarely affordable for billion-scale LLMs, or solving a weight reconstruction problem reliant on second-order information, which may also be computationally expensive. In this paper, we introduce a novel, straightforward yet effective pruning method, termed Wanda (Pruning by Weights and activations), designed to induce sparsity in pretrained LLMs. Motivated by the recent observation of emergent large magnitude features in LLMs, our approach prunes weights with the smallest magnitudes multiplied by the corresponding input activations, on a per-output basis. Notably, Wanda requires no retraining or weight update, and the pruned LLM can be used as is. We conduct a thorough evaluation of our method Wanda on LLaMA and LLaMA-2 across various language benchmarks. Wanda significantly outperforms the established baseline of magnitude pruning and performs competitively against recent method involving intensive weight update. Code is available at https://github.com/locuslab/wanda.

5/7/2024

cs.CL cs.AI cs.LG

BlockPruner: Fine-grained Pruning for Large Language Models

Longguang Zhong, Fanqi Wan, Ruijun Chen, Xiaojun Quan, Liangzhi Li

With the rapid growth in the size and complexity of large language models (LLMs), the costs associated with their training and inference have escalated significantly. Research indicates that certain layers in LLMs harbor substantial redundancy, and pruning these layers has minimal impact on the overall performance. While various layer pruning methods have been developed based on this insight, they generally overlook the finer-grained redundancies within the layers themselves. In this paper, we delve deeper into the architecture of LLMs and demonstrate that finer-grained pruning can be achieved by targeting redundancies in multi-head attention (MHA) and multi-layer perceptron (MLP) blocks. We propose a novel, training-free structured pruning approach called BlockPruner. Unlike existing layer pruning methods, BlockPruner segments each Transformer layer into MHA and MLP blocks. It then assesses the importance of these blocks using perplexity measures and applies a heuristic search for iterative pruning. We applied BlockPruner to LLMs of various sizes and architectures and validated its performance across a wide range of downstream tasks. Experimental results show that BlockPruner achieves more granular and effective pruning compared to state-of-the-art baselines.

6/21/2024

cs.CL