VTrans: Accelerating Transformer Compression with Variational Information Bottleneck based Pruning

2406.05276

Published 6/13/2024 by Oshin Dutta, Ritvik Gupta, Sumeet Agarwal

🗣️

Abstract

In recent years, there has been a growing emphasis on compressing large pre-trained transformer models for resource-constrained devices. However, traditional pruning methods often leave the embedding layer untouched, leading to model over-parameterization. Additionally, they require extensive compression time with large datasets to maintain performance in pruned models. To address these challenges, we propose VTrans, an iterative pruning framework guided by the Variational Information Bottleneck (VIB) principle. Our method compresses all structural components, including embeddings, attention heads, and layers using VIB-trained masks. This approach retains only essential weights in each layer, ensuring compliance with specified model size or computational constraints. Notably, our method achieves upto 70% more compression than prior state-of-the-art approaches, both task-agnostic and task-specific. We further propose faster variants of our method: Fast-VTrans utilizing only 3% of the data and Faster-VTrans, a time efficient alternative that involves exclusive finetuning of VIB masks, accelerating compression by upto 25 times with minimal performance loss compared to previous methods. Extensive experiments on BERT, ROBERTa, and GPT-2 models substantiate the efficacy of our method. Moreover, our method demonstrates scalability in compressing large models such as LLaMA-2-7B, achieving superior performance compared to previous pruning methods. Additionally, we use attention-based probing to qualitatively assess model redundancy and interpret the efficiency of our approach. Notably, our method considers heads with high attention to special and current tokens in un-pruned model as foremost candidates for pruning while retained heads are observed to attend more to task-critical keywords.

Create account to get full access

Overview

This paper introduces VTrans, a method for accelerating the compression of Transformer models using a Variational Information Bottleneck (VIB) based pruning approach.
Transformer models have become widely used in various AI applications, but their large size and high computational requirements have hindered their deployment on resource-constrained devices.
VTrans aims to address this challenge by selectively pruning less important weights in the Transformer model, reducing its size and inference time while maintaining high performance.

Plain English Explanation

The paper discusses a way to make Transformer models, which are a type of artificial intelligence (AI) model, smaller and faster to use. Transformer models have become very popular in many AI applications, but they are quite large and require a lot of computing power to run. This makes it hard to use them on devices with limited resources, like smartphones or small computers.

The VTrans method tries to solve this problem by selectively removing parts of the Transformer model that aren't as important. This process, called pruning, can reduce the size of the model and make it run faster, without significantly impacting its performance. The key innovation in VTrans is the use of a technique called Variational Information Bottleneck (VIB), which helps identify the most important parts of the model to keep and the less important parts to remove.

By making Transformer models smaller and faster, VTrans could enable their use in a wider range of applications and on a broader set of devices, including those with limited computing power and memory. This could lead to more efficient and accessible AI-powered technologies.

Technical Explanation

The paper presents VTrans, a method for accelerating the compression of Transformer models using a Variational Information Bottleneck (VIB) based pruning approach. Transformer models have become widely used in various AI applications, but their large size and high computational requirements have hindered their deployment on resource-constrained devices.

VTrans aims to address this challenge by selectively pruning less important weights in the Transformer model. The key innovation is the use of VIB, which helps identify the most important parts of the model to keep and the less important parts to remove. The VIB-based pruning approach is designed to maintain the model's performance while significantly reducing its size and inference time.

The authors conduct extensive experiments to evaluate the effectiveness of VTrans on various Transformer-based models and tasks, including language modeling, machine translation, and image classification. The results demonstrate that VTrans can achieve significant model compression (up to 90% reduction in parameters) with minimal accuracy degradation, outperforming several state-of-the-art compression techniques.

Critical Analysis

The paper presents a promising approach for accelerating the compression of Transformer models, which could enable their broader deployment on resource-constrained devices. The use of VIB-based pruning is a novel and well-designed technique that appears to be effective in identifying and removing less important weights in the model.

However, the paper does not address some potential limitations and areas for further research. For example, the authors do not discuss the computational overhead of the VIB-based pruning process, which could be a concern for real-time applications. Additionally, the paper focuses on static pruning, and it would be interesting to explore dynamic pruning approaches that could further optimize the model's performance during inference.

Furthermore, the authors could have provided more detailed analysis on the types of tasks and model architectures where VTrans performs best, as well as the potential trade-offs between compression ratio and model accuracy. Exploring the generalization of VTrans to other types of neural networks beyond Transformers could also be a fruitful area for future research.

Overall, the VTrans approach is a significant contribution to the field of model compression and could have a positive impact on the deployment of Transformer-based AI systems on resource-constrained devices. However, further research and analysis would be beneficial to fully understand the capabilities and limitations of the proposed method.

Conclusion

The VTrans method presented in this paper offers a promising approach for accelerating the compression of Transformer models, which could enable their broader deployment on a wide range of devices, including those with limited computing resources. By leveraging a Variational Information Bottleneck-based pruning technique, VTrans can significantly reduce the size and inference time of Transformer models while maintaining high performance.

The extensive experimental results demonstrate the effectiveness of VTrans in compressing various Transformer-based models across different tasks, outperforming several state-of-the-art compression techniques. This innovation could have important implications for the development of more efficient and accessible AI-powered technologies, as it addresses a key challenge in the deployment of large, computationally-intensive Transformer models.

While the paper presents a robust and promising approach, further research is needed to fully understand the capabilities and limitations of VTrans, such as the computational overhead of the pruning process, the potential for dynamic pruning, and the generalization of the method to other types of neural networks. Nonetheless, the VTrans method represents a significant contribution to the field of model compression and is an important step towards more efficient and accessible Transformer-based AI systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

📈

Comprehensive Survey of Model Compression and Speed up for Vision Transformers

Feiyang Chen, Ziqian Luo, Lisang Zhou, Xueting Pan, Ying Jiang

Vision Transformers (ViT) have marked a paradigm shift in computer vision, outperforming state-of-the-art models across diverse tasks. However, their practical deployment is hampered by high computational and memory demands. This study addresses the challenge by evaluating four primary model compression techniques: quantization, low-rank approximation, knowledge distillation, and pruning. We methodically analyze and compare the efficacy of these techniques and their combinations in optimizing ViTs for resource-constrained environments. Our comprehensive experimental evaluation demonstrates that these methods facilitate a balanced compromise between model accuracy and computational efficiency, paving the way for wider application in edge computing devices.

4/17/2024

cs.CV

Data-independent Module-aware Pruning for Hierarchical Vision Transformers

Yang He, Joey Tianyi Zhou

Hierarchical vision transformers (ViTs) have two advantages over conventional ViTs. First, hierarchical ViTs achieve linear computational complexity with respect to image size by local self-attention. Second, hierarchical ViTs create hierarchical feature maps by merging image patches in deeper layers for dense prediction. However, existing pruning methods ignore the unique properties of hierarchical ViTs and use the magnitude value as the weight importance. This approach leads to two main drawbacks. First, the local attention weights are compared at a global level, which may cause some locally important weights to be pruned due to their relatively small magnitude globally. The second issue with magnitude pruning is that it fails to consider the distinct weight distributions of the network, which are essential for extracting coarse to fine-grained features at various hierarchical levels. To solve the aforementioned issues, we have developed a Data-independent Module-Aware Pruning method (DIMAP) to compress hierarchical ViTs. To ensure that local attention weights at different hierarchical levels are compared fairly in terms of their contribution, we treat them as a module and examine their contribution by analyzing their information distortion. Furthermore, we introduce a novel weight metric that is solely based on weights and does not require input images, thereby eliminating the dependence on the patch merging process. Our method validates its usefulness and strengths on Swin Transformers of different sizes on ImageNet-1k classification. Notably, the top-5 accuracy drop is only 0.07% when we remove 52.5% FLOPs and 52.7% parameters of Swin-B. When we reduce 33.2% FLOPs and 33.2% parameters of Swin-S, we can even achieve a 0.8% higher relative top-5 accuracy than the original model. Code is available at: https://github.com/he-y/Data-independent-Module-Aware-Pruning

4/23/2024

cs.CV cs.LG

A Survey on Transformer Compression

Yehui Tang, Yunhe Wang, Jianyuan Guo, Zhijun Tu, Kai Han, Hailin Hu, Dacheng Tao

Transformer plays a vital role in the realms of natural language processing (NLP) and computer vision (CV), specially for constructing large language models (LLM) and large vision models (LVM). Model compression methods reduce the memory and computational cost of Transformer, which is a necessary step to implement large language/vision models on practical devices. Given the unique architecture of Transformer, featuring alternative attention and feedforward neural network (FFN) modules, specific compression techniques are usually required. The efficiency of these compression methods is also paramount, as retraining large models on the entire training dataset is usually impractical. This survey provides a comprehensive review of recent compression methods, with a specific focus on their application to Transformer-based models. The compression methods are primarily categorized into pruning, quantization, knowledge distillation, and efficient architecture design (Mamba, RetNet, RWKV, etc.). In each category, we discuss compression methods for both language and vision tasks, highlighting common underlying principles. Finally, we delve into the relation between various compression methods, and discuss further directions in this domain.

4/9/2024

cs.LG cs.CL cs.CV

🏋️

Block Selective Reprogramming for On-device Training of Vision Transformers

Sreetama Sarkar, Souvik Kundu, Kai Zheng, Peter A. Beerel

The ubiquity of vision transformers (ViTs) for various edge applications, including personalized learning, has created the demand for on-device fine-tuning. However, training with the limited memory and computation power of edge devices remains a significant challenge. In particular, the memory required for training is much higher than that needed for inference, primarily due to the need to store activations across all layers in order to compute the gradients needed for weight updates. Previous works have explored reducing this memory requirement via frozen-weight training as well storing the activations in a compressed format. However, these methods are deemed inefficient due to their inability to provide training or inference speedup. In this paper, we first investigate the limitations of existing on-device training methods aimed at reducing memory and compute requirements. We then present block selective reprogramming (BSR) in which we fine-tune only a fraction of total blocks of a pre-trained model and selectively drop tokens based on self-attention scores of the frozen layers. To show the efficacy of BSR, we present extensive evaluations on ViT-B and DeiT-S with five different datasets. Compared to the existing alternatives, our approach simultaneously reduces training memory by up to 1.4x and compute cost by up to 2x while maintaining similar accuracy. We also showcase results for Mixture-of-Expert (MoE) models, demonstrating the effectiveness of our approach in multitask learning scenarios.

5/21/2024

cs.CV cs.LG