Dependency-Aware Semi-Structured Sparsity of GLU Variants in Large Language Models

Read original: arXiv:2405.01943 - Published 6/21/2024 by Zhiyu Guo, Hidetaka Kamigaito, Taro Wanatnabe

💬

Overview

Rapid advancements in Large Language Models (LLMs) have improved language understanding and generation, but the large model size creates hardware challenges.
To address these challenges, the paper proposes a novel pruning method called Dependency-aware Semi-structured Sparsity (DaSS) for SwiGLU-based LLMs.
DaSS incorporates structural dependency into unstructured pruning, balancing adaptability and structural consistency.
Evaluations show DaSS outperforms other pruning methods in achieving hardware-friendly sparsity patterns while maintaining computational efficiency.

Plain English Explanation

Large language models (LLMs) have become increasingly powerful at understanding and generating human-like text. However, the massive size of these models poses challenges for the hardware required to run them, such as needing a lot of memory and taking a long time to generate new text.

To address these issues, the researchers developed a new pruning technique called Dependency-aware Semi-structured Sparsity (DaSS). Pruning is the process of removing parts of a model to make it smaller and more efficient, while still preserving its core capabilities.

DaSS works by identifying the most important weights (numerical values) in the model and removing the less important ones. But it does this in a unique way - it considers not just the magnitude (size) of each weight, but also how that weight is connected to other parts of the model. This helps maintain the overall structure and consistency of the model, rather than just randomly removing weights.

By using this more sophisticated pruning approach, the researchers were able to create smaller, more efficient versions of large language models that still performed well. Their tests showed DaSS outperformed other pruning methods, achieving the right balance between compactness and computational speed.

The end result is that DaSS could help make large language models more practical to deploy on real-world hardware, without sacrificing too much of their impressive capabilities.

Technical Explanation

The paper introduces a novel pruning method called Dependency-aware Semi-structured Sparsity (DaSS) to address the hardware challenges posed by the substantial model size of recent SwiGLU-based Large Language Models.

Unlike traditional unstructured pruning that only considers weight magnitudes, DaSS incorporates structural dependency information by evaluating the importance of each weight based on both its magnitude and its corresponding intermediate activation norms in the Multi-Layer Perceptron (MLP) sub-layers. This approach helps maintain the computational efficiency of the Wanda pruning method while achieving more hardware-friendly N:M sparsity patterns compared to prior techniques like SparseGPT.

The researchers evaluate DaSS on the Mistral and LLaMA2 model families, demonstrating its ability to outperform existing pruning methods in terms of achieving the desired sparsity patterns without sacrificing computational performance.

Critical Analysis

The paper provides a thorough evaluation of the DaSS pruning method, exploring its effectiveness across different model families. However, the authors acknowledge that their approach may not generalize to all types of language models, as the structural dependencies can vary. Further research is needed to understand how DaSS performs on a wider range of architectures and applications.

Additionally, the paper does not delve into the potential implications of deploying heavily pruned models in real-world scenarios. While DaSS achieves compelling hardware efficiency, the authors do not address potential trade-offs in terms of model robustness, safety, or fairness. Exploring these aspects would be an important area for future work.

Overall, the DaSS pruning technique represents a promising step forward in making large language models more accessible and deployable on a variety of hardware platforms. However, continued research and careful consideration of the broader implications are necessary to ensure these advancements benefit society as a whole.

Conclusion

The paper presents a novel pruning method called Dependency-aware Semi-structured Sparsity (DaSS) that addresses the hardware challenges associated with large language models. By incorporating structural dependency information into the pruning process, DaSS is able to achieve hardware-friendly sparsity patterns while maintaining computational efficiency.

The researchers' evaluations demonstrate the effectiveness of DaSS, showing that it outperforms existing pruning techniques in terms of both sparsity and performance. This work represents an important step forward in making large language models more practical to deploy on real-world hardware, potentially paving the way for wider adoption and more accessible AI-powered applications.

As the field of language modeling continues to evolve, the principles and insights from the DaSS approach may inspire further innovations in model compression and efficiency optimization, ultimately helping to unlock the full potential of these powerful AI systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

💬

Dependency-Aware Semi-Structured Sparsity of GLU Variants in Large Language Models

Zhiyu Guo, Hidetaka Kamigaito, Taro Wanatnabe

The rapid growth in the scale of Large Language Models (LLMs) has led to significant computational and memory costs, making model compression techniques such as network pruning increasingly crucial for their efficient deployment. Recent LLMs such as LLaMA2 and Mistral have adopted GLU-based MLP architectures. However, current LLM pruning strategies are primarily based on insights from older LLM architectures, necessitating a reevaluation of these strategies to suit the new architectural characteristics. Contrary to traditional beliefs, we find that outliers play a diminished role in the input projections of GLU-based MLPs. Leveraging this new insight, we propose Dependency-aware Semi-structured Sparsity (DaSS), a novel pruning method for GLU-based LLMs. DaSS balances the flexibility of unstructured pruning and the structural consistency of dependency-based structured pruning by considering both of weight magnitude and corresponding intermediate activation norms in weight pruning metric. Empirical evaluations on the Mistral, Gemma, and LLaMA2 model families demonstrate the consistent effectiveness of DaSS in the prevailing GLU variants.

6/21/2024

💬

One-Shot Sensitivity-Aware Mixed Sparsity Pruning for Large Language Models

Hang Shao, Bei Liu, Bo Xiao, Ke Zeng, Guanglu Wan, Yanmin Qian

Various Large Language Models~(LLMs) from the Generative Pretrained Transformer(GPT) family have achieved outstanding performances in a wide range of text generation tasks. However, the enormous model sizes have hindered their practical use in real-world applications due to high inference latency. Therefore, improving the efficiencies of LLMs through quantization, pruning, and other means has been a key issue in LLM studies. In this work, we propose a method based on Hessian sensitivity-aware mixed sparsity pruning to prune LLMs to at least 50% sparsity without the need of any retraining. It allocates sparsity adaptively based on sensitivity, allowing us to reduce pruning-induced error while maintaining the overall sparsity level. The advantages of the proposed method exhibit even more when the sparsity is extremely high. Furthermore, our method is compatible with quantization, enabling further compression of LLMs. We have released the available code.

4/24/2024

💬

Beyond Size: How Gradients Shape Pruning Decisions in Large Language Models

Rocktim Jyoti Das, Mingjie Sun, Liqun Ma, Zhiqiang Shen

Large Language Models (LLMs) with billions of parameters are prime targets for network pruning, removing some model weights without hurting performance. Prior approaches such as magnitude pruning, SparseGPT, and Wanda, either concentrated solely on weights or integrated weights with activations for sparsity. However, they overlooked the informative gradients derived from pretrained LLMs. In this paper, we present a novel sparsity-centric pruning method for pretrained LLMs, termed Gradient-based Language Model Pruner (GBLM-Pruner). GBLM-Pruner leverages the first-order term of the Taylor expansion, operating in a training-free manner by harnessing properly normalized gradients from a few calibration samples to determine the pruning metric, and substantially outperforms competitive counterparts like SparseGPT and Wanda in multiple benchmarks. Intriguingly, by incorporating gradients, unstructured pruning with our method tends to reveal some structural patterns, which mirrors the geometric interdependence inherent in the LLMs' parameter structure. Additionally, GBLM-Pruner functions without any subsequent retraining or weight updates to maintain its simplicity as other counterparts. Extensive evaluations on LLaMA-1 and LLaMA-2 across various benchmarks show that GBLM-Pruner surpasses magnitude pruning, Wanda and SparseGPT by significant margins. We further extend our approach on Vision Transformer. Our code and models are available at https://github.com/VILA-Lab/GBLM-Pruner.

4/10/2024

Enhancing One-shot Pruned Pre-trained Language Models through Sparse-Dense-Sparse Mechanism

Guanchen Li, Xiandong Zhao, Lian Liu, Zeping Li, Dong Li, Lu Tian, Jie He, Ashish Sirasao, Emad Barsoum

Pre-trained language models (PLMs) are engineered to be robust in contextual understanding and exhibit outstanding performance in various natural language processing tasks. However, their considerable size incurs significant computational and storage costs. Modern pruning strategies employ one-shot techniques to compress PLMs without the need for retraining on task-specific or otherwise general data; however, these approaches often lead to an indispensable reduction in performance. In this paper, we propose SDS, a Sparse-Dense-Sparse pruning framework to enhance the performance of the pruned PLMs from a weight distribution optimization perspective. We outline the pruning process in three steps. Initially, we prune less critical connections in the model using conventional one-shot pruning methods. Next, we reconstruct a dense model featuring a pruning-friendly weight distribution by reactivating pruned connections with sparse regularization. Finally, we perform a second pruning round, yielding a superior pruned model compared to the initial pruning. Experimental results demonstrate that SDS outperforms the state-of-the-art pruning techniques SparseGPT and Wanda under an identical sparsity configuration. For instance, SDS reduces perplexity by 9.13 on Raw-Wikitext2 and improves accuracy by an average of 2.05% across multiple zero-shot benchmarks for OPT-125M with 2:4 sparsity.

8/21/2024