$rm SP^3$: Enhancing Structured Pruning via PCA Projection

2308.16475

Published 4/23/2024 by Yuxuan Hu, Jing Zhang, Zhe Zhao, Chen Zhao, Xiaodong Chen, Cuiping Li, Hong Chen

🛠️

Abstract

Structured pruning is a widely used technique for reducing the size of pre-trained language models (PLMs), but current methods often overlook the potential of compressing the hidden dimension (d) in PLMs, a dimension critical to model size and efficiency. This paper introduces a novel structured pruning approach, Structured Pruning with PCA Projection (SP3), targeting the effective reduction of d by projecting features into a space defined by principal components before masking. Extensive experiments on benchmarks (GLUE and SQuAD) show that SP3 can reduce d by 70%, compress 94% of the BERTbase model, maintain over 96% accuracy, and outperform other methods that compress d by 6% in accuracy at the same compression ratio. SP3 has also proven effective with other models, including OPT and Llama. Our data and code are available at an anonymous repo.

Create account to get full access

Overview

Structured pruning is a widely used technique for reducing the size of pre-trained language models (PLMs).
Current methods often overlook the potential of compressing the hidden dimension (d) in PLMs, which is critical to model size and efficiency.
This paper introduces a novel structured pruning approach called Structured Pruning with PCA Projection (SP3), which targets the effective reduction of d by projecting features into a space defined by principal components before masking.

Plain English Explanation

The paper discusses a new way to make language models smaller and more efficient. Language models are AI systems that can understand and generate human-like text, but they can be very large and complex. Structured pruning is a common technique used to reduce the size of these models, but the authors found that current methods often overlook an important part of the model - the hidden dimension (d).

The hidden dimension is a crucial aspect that determines the model's size and efficiency. The authors' new approach, called Structured Pruning with PCA Projection (SP3), focuses on effectively reducing this hidden dimension. The key idea is to project the model's features into a space defined by principal components before applying the pruning technique. This allows the model to be compressed while still maintaining most of its accuracy.

The authors tested SP3 on popular language model benchmarks and found that it can reduce the hidden dimension by 70%, compress the BERTbase model by 94%, and maintain over 96% accuracy - outperforming other methods that compress the hidden dimension. The authors also showed that SP3 works well with other language models, such as OPT and Llama.

Technical Explanation

The paper introduces a novel structured pruning approach called Structured Pruning with PCA Projection (SP3), which targets the effective reduction of the hidden dimension (d) in pre-trained language models (PLMs). Current pruning methods often overlook the potential of compressing d, which is critical to model size and efficiency.

SP3 works by projecting the model's features into a space defined by principal components before applying the pruning technique. This allows the model to be compressed while still maintaining most of its accuracy. The authors conducted extensive experiments on language model benchmarks (GLUE and SQuAD) and found that SP3 can reduce d by 70%, compress the BERTbase model by 94%, and maintain over 96% accuracy. Moreover, SP3 outperformed other methods that compress d by 6% in accuracy at the same compression ratio.

The authors also demonstrated the effectiveness of SP3 with other language models, including OPT and Llama. The data and code for this research are available in an anonymous repository.

Critical Analysis

The paper presents a promising approach to compressing pre-trained language models, but it is important to consider some potential limitations and areas for further research.

One potential concern is the generalizability of the SP3 method. The authors have demonstrated its effectiveness on several popular language models, but it would be valuable to see how it performs on a wider range of architectures and tasks. Additionally, the authors mention that the method may be less suitable for extremely sparse models, so further exploration of its limitations in this regard would be beneficial.

Another area for consideration is the potential impact of the reduced hidden dimension on the model's capabilities. While the authors show that SP3 can maintain high accuracy, it would be interesting to investigate whether there are any subtle changes in the model's behavior or performance on more nuanced or specialized tasks.

Finally, the paper could have explored the computational efficiency of the SP3 method in more detail, as the trade-offs between model size, accuracy, and inference speed are of critical importance for real-world applications. Neuroprune and other structured pruning techniques may provide useful points of comparison in this regard.

Conclusion

The Structured Pruning with PCA Projection (SP3) approach introduced in this paper represents a significant advancement in the field of language model compression. By effectively reducing the hidden dimension (d) - a key factor in model size and efficiency - SP3 is able to achieve impressive compression rates while maintaining high accuracy. The authors' extensive experiments and the demonstrated effectiveness across multiple language models suggest that SP3 could have a meaningful impact on the development of more compact and efficient pre-trained language models, with potential benefits for a wide range of real-world applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

💬

Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning

Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, Danqi Chen

The popularity of LLaMA (Touvron et al., 2023a;b) and other recently emerged moderate-sized large language models (LLMs) highlights the potential of building smaller yet powerful LLMs. Regardless, the cost of training such models from scratch on trillions of tokens remains high. In this work, we study structured pruning as an effective means to develop smaller LLMs from pre-trained, larger models. Our approach employs two key techniques: (1) targeted structured pruning, which prunes a larger model to a specified target shape by removing layers, heads, and intermediate and hidden dimensions in an end-to-end manner, and (2) dynamic batch loading, which dynamically updates the composition of sampled data in each training batch based on varying losses across different domains. We demonstrate the efficacy of our approach by presenting the Sheared-LLaMA series, pruning the LLaMA2-7B model down to 1.3B and 2.7B parameters. Sheared-LLaMA models outperform state-of-the-art open-source models of equivalent sizes, such as Pythia, INCITE, OpenLLaMA and the concurrent TinyLlama models, on a wide range of downstream and instruction tuning evaluations, while requiring only 3% of compute compared to training such models from scratch. This work provides compelling evidence that leveraging existing LLMs with structured pruning is a far more cost-effective approach for building competitive small-scale LLMs

4/12/2024

cs.CL cs.AI cs.LG

Efficient Pruning of Large Language Model with Adaptive Estimation Fusion

Jun Liu, Chao Wu, Changdi Yang, Hao Tang, Zhenglun Kong, Geng Yuan, Wei Niu, Dong Huang, Yanzhi Wang

Large language models (LLMs) have become crucial for many generative downstream tasks, leading to an inevitable trend and significant challenge to deploy them efficiently on resource-constrained devices. Structured pruning is a widely used method to address this challenge. However, when dealing with the complex structure of the multiple decoder layers, general methods often employ common estimation approaches for pruning. These approaches lead to a decline in accuracy for specific downstream tasks. In this paper, we introduce a simple yet efficient method that adaptively models the importance of each substructure. Meanwhile, it can adaptively fuse coarse-grained and finegrained estimations based on the results from complex and multilayer structures. All aspects of our design seamlessly integrate into the endto-end pruning framework. Our experimental results, compared with state-of-the-art methods on mainstream datasets, demonstrate average accuracy improvements of 1.1%, 1.02%, 2.0%, and 1.2% for LLaMa-7B,Vicuna-7B, Baichuan-7B, and Bloom-7b1, respectively.

5/16/2024

cs.CL cs.AI cs.LG

SparseLLM: Towards Global Pruning for Pre-trained Language Models

Guangji Bai, Yijiang Li, Chen Ling, Kibaek Kim, Liang Zhao

The transformative impact of large language models (LLMs) like LLaMA and GPT on natural language processing is countered by their prohibitive computational demands. Pruning has emerged as a pivotal compression strategy, introducing sparsity to enhance both memory and computational efficiency. Yet, traditional global pruning is impractical for LLMs due to scalability issues, while local pruning, despite its efficiency, leads to suboptimal solutions. Addressing these challenges, we propose SparseLLM, a novel framework that redefines the global pruning process into manageable, coordinated subproblems, allowing for resource-efficient optimization with global optimality. SparseLLM's approach, which conceptualizes LLMs as a chain of modular functions and leverages auxiliary variables for problem decomposition, not only facilitates a pragmatic application on LLMs but also demonstrates significant performance improvements, particularly in high-sparsity regimes where it surpasses current state-of-the-art methods.

5/27/2024

cs.CL

Optimization-based Structural Pruning for Large Language Models without Back-Propagation

Yuan Gao, Zujing Liu, Weizhong Zhang, Bo Du, Gui-Song Xia

Compared to the moderate size of neural network models, structural weight pruning on the Large-Language Models (LLMs) imposes a novel challenge on the efficiency of the pruning algorithms, due to the heavy computation/memory demands of the LLMs. Recent efficient LLM pruning methods typically operate at the post-training phase without the expensive weight finetuning, however, their pruning criteria often rely on heuristically designed metrics, potentially leading to suboptimal performance. We instead propose a novel optimization-based structural pruning that learns the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model. To preserve the efficiency, our method 1) works at post-training phase} and 2) eliminates the back-propagation through the LLM per se during the optimization (i.e., only requires the forward pass of the LLM). We achieve this by learning an underlying Bernoulli distribution to sample binary pruning masks, where we decouple the Bernoulli parameters from the LLM loss, thus facilitating an efficient optimization via a policy gradient estimator without back-propagation. As a result, our method is able to 1) operate at structural granularities of channels, heads, and layers, 2) support global and heterogeneous pruning (i.e., our method automatically determines different redundancy for different layers), and 3) optionally use a metric-based method as initialization (of our Bernoulli distributions). Extensive experiments on LLaMA, LLaMA-2, and Vicuna using the C4 and WikiText2 datasets demonstrate that our method operates for 2.7 hours with around 35GB memory for the 13B models on a single A100 GPU, and our pruned models outperform the state-of-the-arts w.r.t. perplexity. Codes will be released.

6/18/2024

cs.LG cs.CL stat.ML