Structural Pruning of Pre-trained Language Models via Neural Architecture Search

2405.02267

Published 5/6/2024 by Aaron Klein, Jacek Golebiowski, Xingchen Ma, Valerio Perrone, Cedric Archambeau

💬

Abstract

Pre-trained language models (PLM), for example BERT or RoBERTa, mark the state-of-the-art for natural language understanding task when fine-tuned on labeled data. However, their large size poses challenges in deploying them for inference in real-world applications, due to significant GPU memory requirements and high inference latency. This paper explores neural architecture search (NAS) for structural pruning to find sub-parts of the fine-tuned network that optimally trade-off efficiency, for example in terms of model size or latency, and generalization performance. We also show how we can utilize more recently developed two-stage weight-sharing NAS approaches in this setting to accelerate the search process. Unlike traditional pruning methods with fixed thresholds, we propose to adopt a multi-objective approach that identifies the Pareto optimal set of sub-networks, allowing for a more flexible and automated compression process.

Create account to get full access

Overview

This paper explores using neural architecture search (NAS) to find efficient sub-networks within large pre-trained language models (PLMs) like BERT and RoBERTa.
PLMs are state-of-the-art for natural language understanding tasks, but their large size makes them challenging to deploy in real-world applications due to high GPU memory requirements and inference latency.
The researchers propose a multi-objective NAS approach to identify Pareto optimal sub-networks that balance efficiency (e.g., model size, latency) and generalization performance.
They also show how two-stage weight-sharing NAS can be used to accelerate the search process, unlike traditional pruning methods with fixed thresholds.

Plain English Explanation

Large pre-trained language models (PLMs) like BERT and RoBERTa are very good at understanding and processing natural language. However, their large size makes them challenging to use in real-world applications, as they require a lot of computer memory and take a long time to make predictions.

This paper explores a technique called neural architecture search (NAS) to find smaller, more efficient versions of these large PLMs. The idea is to identify sub-parts of the original model that can still perform well on language tasks, but use less memory and run faster. Unlike traditional pruning methods that use fixed thresholds, the researchers propose a multi-objective approach that looks for the best trade-off between efficiency (e.g., smaller model size, faster inference) and performance.

The researchers also show how they can speed up the search process by using a two-stage weight-sharing NAS approach, rather than starting from scratch each time. This allows them to explore more candidate sub-networks in less time.

Technical Explanation

The paper focuses on using neural architecture search (NAS) to find efficient sub-networks within large pre-trained language models (PLMs) like BERT and RoBERTa. PLMs have set new state-of-the-art benchmarks for natural language understanding tasks when fine-tuned on labeled data. However, their large size poses challenges for deploying them in real-world applications due to significant GPU memory requirements and high inference latency.

The researchers propose a multi-objective NAS approach to identify Pareto optimal sub-networks that balance efficiency (e.g., model size, latency) and generalization performance. Unlike traditional pruning methods that use fixed thresholds, this allows for a more flexible and automated compression process. They also demonstrate how to leverage recently developed two-stage weight-sharing NAS techniques to accelerate the search, rather than starting from scratch each time.

The key elements of the paper include:

Formulating the sub-network identification as a multi-objective optimization problem to find the Pareto optimal set
Adapting two-stage weight-sharing NAS approaches to this setting to speed up the search
Comprehensive experiments evaluating the effectiveness of the proposed method on various PLM architectures and downstream tasks

Critical Analysis

The paper presents a novel and promising approach to addressing the efficiency challenges of large pre-trained language models. By using multi-objective NAS to identify Pareto optimal sub-networks, the researchers show how it's possible to find models that are both performant and lightweight, without having to manually tune pruning thresholds.

However, the paper does not delve into potential limitations or areas for further research. For example, it's unclear how the identified sub-networks would perform on a wider range of tasks beyond the specific ones evaluated, or how well the approach would scale to even larger PLMs. Additionally, the computational cost and search time of the two-stage NAS technique could be an important practical consideration for real-world deployment.

It would also be valuable to see the researchers explore the interpretability of the discovered sub-networks, to better understand what architectural elements are most important for maintaining performance. This could yield insights that complement the automated search process.

Overall, the paper makes a compelling case for the use of multi-objective NAS to balance efficiency and effectiveness in pre-trained language models. Further research addressing the potential limitations and exploring additional applications would help strengthen the impact of this work.

Conclusion

This paper presents a novel approach to improving the efficiency of large pre-trained language models (PLMs) like BERT and RoBERTa. By using neural architecture search (NAS) to identify Pareto optimal sub-networks, the researchers demonstrate how it's possible to find models that are both performant and lightweight, without having to manually tune pruning thresholds.

The key contributions of this work include:

A multi-objective NAS formulation to balance efficiency (e.g., model size, latency) and generalization performance
Adapting two-stage weight-sharing NAS techniques to accelerate the search process
Comprehensive experiments validating the effectiveness of the proposed method

This research has important implications for deploying state-of-the-art language models in real-world applications, where memory and latency constraints are critical. Further work exploring the scalability, interpretability, and broader application of these techniques could yield valuable insights for the field of natural language processing.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

LLaMA-NAS: Efficient Neural Architecture Search for Large Language Models

Anthony Sarah, Sharath Nittur Sridhar, Maciej Szankin, Sairam Sundaresan

The abilities of modern large language models (LLMs) in solving natural language processing, complex reasoning, sentiment analysis and other tasks have been extraordinary which has prompted their extensive adoption. Unfortunately, these abilities come with very high memory and computational costs which precludes the use of LLMs on most hardware platforms. To mitigate this, we propose an effective method of finding Pareto-optimal network architectures based on LLaMA2-7B using one-shot NAS. In particular, we fine-tune LLaMA2-7B only once and then apply genetic algorithm-based search to find smaller, less computationally complex network architectures. We show that, for certain standard benchmark tasks, the pre-trained LLaMA2-7B network is unnecessarily large and complex. More specifically, we demonstrate a 1.5x reduction in model size and 1.3x speedup in throughput for certain tasks with negligible drop in accuracy. In addition to finding smaller, higher-performing network architectures, our method does so more effectively and efficiently than certain pruning or sparsification techniques. Finally, we demonstrate how quantization is complementary to our method and that the size and complexity of the networks we find can be further decreased using quantization. We believe that our work provides a way to automatically create LLMs which can be used on less expensive and more readily available hardware platforms.

5/29/2024

cs.AI

Optimization-based Structural Pruning for Large Language Models without Back-Propagation

Yuan Gao, Zujing Liu, Weizhong Zhang, Bo Du, Gui-Song Xia

Compared to the moderate size of neural network models, structural weight pruning on the Large-Language Models (LLMs) imposes a novel challenge on the efficiency of the pruning algorithms, due to the heavy computation/memory demands of the LLMs. Recent efficient LLM pruning methods typically operate at the post-training phase without the expensive weight finetuning, however, their pruning criteria often rely on heuristically designed metrics, potentially leading to suboptimal performance. We instead propose a novel optimization-based structural pruning that learns the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model. To preserve the efficiency, our method 1) works at post-training phase} and 2) eliminates the back-propagation through the LLM per se during the optimization (i.e., only requires the forward pass of the LLM). We achieve this by learning an underlying Bernoulli distribution to sample binary pruning masks, where we decouple the Bernoulli parameters from the LLM loss, thus facilitating an efficient optimization via a policy gradient estimator without back-propagation. As a result, our method is able to 1) operate at structural granularities of channels, heads, and layers, 2) support global and heterogeneous pruning (i.e., our method automatically determines different redundancy for different layers), and 3) optionally use a metric-based method as initialization (of our Bernoulli distributions). Extensive experiments on LLaMA, LLaMA-2, and Vicuna using the C4 and WikiText2 datasets demonstrate that our method operates for 2.7 hours with around 35GB memory for the 13B models on a single A100 GPU, and our pruned models outperform the state-of-the-arts w.r.t. perplexity. Codes will be released.

6/18/2024

cs.LG cs.CL stat.ML

Efficient Pruning of Large Language Model with Adaptive Estimation Fusion

Jun Liu, Chao Wu, Changdi Yang, Hao Tang, Zhenglun Kong, Geng Yuan, Wei Niu, Dong Huang, Yanzhi Wang

Large language models (LLMs) have become crucial for many generative downstream tasks, leading to an inevitable trend and significant challenge to deploy them efficiently on resource-constrained devices. Structured pruning is a widely used method to address this challenge. However, when dealing with the complex structure of the multiple decoder layers, general methods often employ common estimation approaches for pruning. These approaches lead to a decline in accuracy for specific downstream tasks. In this paper, we introduce a simple yet efficient method that adaptively models the importance of each substructure. Meanwhile, it can adaptively fuse coarse-grained and finegrained estimations based on the results from complex and multilayer structures. All aspects of our design seamlessly integrate into the endto-end pruning framework. Our experimental results, compared with state-of-the-art methods on mainstream datasets, demonstrate average accuracy improvements of 1.1%, 1.02%, 2.0%, and 1.2% for LLaMa-7B,Vicuna-7B, Baichuan-7B, and Bloom-7b1, respectively.

5/16/2024

cs.CL cs.AI cs.LG

💬

Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning

Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, Danqi Chen

The popularity of LLaMA (Touvron et al., 2023a;b) and other recently emerged moderate-sized large language models (LLMs) highlights the potential of building smaller yet powerful LLMs. Regardless, the cost of training such models from scratch on trillions of tokens remains high. In this work, we study structured pruning as an effective means to develop smaller LLMs from pre-trained, larger models. Our approach employs two key techniques: (1) targeted structured pruning, which prunes a larger model to a specified target shape by removing layers, heads, and intermediate and hidden dimensions in an end-to-end manner, and (2) dynamic batch loading, which dynamically updates the composition of sampled data in each training batch based on varying losses across different domains. We demonstrate the efficacy of our approach by presenting the Sheared-LLaMA series, pruning the LLaMA2-7B model down to 1.3B and 2.7B parameters. Sheared-LLaMA models outperform state-of-the-art open-source models of equivalent sizes, such as Pythia, INCITE, OpenLLaMA and the concurrent TinyLlama models, on a wide range of downstream and instruction tuning evaluations, while requiring only 3% of compute compared to training such models from scratch. This work provides compelling evidence that leveraging existing LLMs with structured pruning is a far more cost-effective approach for building competitive small-scale LLMs

4/12/2024

cs.CL cs.AI cs.LG