ShiftAddViT: Mixture of Multiplication Primitives Towards Efficient Vision Transformer

Read original: arXiv:2306.06446 - Published 6/12/2024 by Haoran You (Celine), Huihong Shi (Celine), Yipin Guo (Celine), Yingyan (Celine), Lin

👀

Overview

Vision Transformers (ViTs) have shown impressive performance and have become a unified backbone for multiple vision tasks.
However, both the attention mechanism and multi-layer perceptrons (MLPs) in ViTs are not sufficiently efficient due to dense multiplications, leading to costly training and inference.
The authors propose a new model called ShiftAddViT that reparameterizes pre-trained ViTs with a mixture of multiplication primitives, such as bitwise shifts and additions, to achieve end-to-end inference speedups on GPUs without requiring training from scratch.

Plain English Explanation

Vision Transformers (ViTs) are a type of deep learning model that have shown great success in a variety of visual tasks. However, the key components of ViTs, the attention mechanism and multi-layer perceptrons (MLPs), are computationally expensive due to the large number of multiplications involved. This makes ViTs costly to train and run efficiently on hardware like GPUs.

The researchers propose a new model called ShiftAddViT that aims to address this issue. Instead of using standard multiplications, ShiftAddViT reparameterizes the attention mechanism to use simpler operations like bitwise shifts and additions. This reduces the overall computational burden without significantly impacting the model's accuracy.

For the remaining MLP layers, the researchers develop a new "mixture of experts" framework that dynamically assigns different parts of the input to specialized sub-modules that use either multiplication or shift-based operations, depending on which is more efficient. This helps maintain the model's performance while further accelerating inference.

The key idea is to transform the original ViT model into a more efficient version, ShiftAddViT, that can run much faster on GPUs without having to retrain the entire model from scratch. This could lead to significant improvements in the real-world deployment of ViTs.

Technical Explanation

The authors propose a new model, dubbed ShiftAddViT, that reparameterizes pre-trained Vision Transformers (ViTs) to replace the computationally expensive matrix multiplications (MatMuls) with more efficient operations like bitwise shifts and additions.

Specifically, the attention mechanism in ViTs is reparameterized by mapping the queries and keys to binary codes in Hamming space, allowing the MatMuls among queries, keys, and values to be replaced with additive kernels. The remaining MLP or linear layers are then reparameterized with shift kernels.

To handle the accuracy drops caused by the MLP reparameterization, the authors propose a new "mixture of experts" (MoE) framework. This framework dynamically assigns input tokens to different expert sub-modules that use either multiplication or shift-based operations, depending on their latency. A new latency-aware load-balancing loss function is introduced to train the routing mechanism.

The authors utilize the TVM framework to implement and optimize these customized kernels for practical hardware deployment on GPUs. Extensive experiments on various 2D and 3D Transformer-based vision tasks demonstrate that ShiftAddViT can achieve up to 5.18x latency reductions on GPUs and 42.9% energy savings, while maintaining comparable accuracy to the original or efficient ViT models.

Critical Analysis

The paper presents a clever approach to reparameterizing ViT models to reduce the computational burden of the attention mechanism and MLPs, which are the main bottlenecks in these models. The proposed ShiftAddViT model and the mixture of experts framework for handling MLP reparameterization are both interesting and well-designed solutions.

However, the authors acknowledge that the MLP reparameterization does lead to some accuracy drops, which they try to mitigate with the MoE approach. It would be valuable to see more analysis on the trade-offs between the achieved inference speedups and the impact on model performance across different tasks and datasets.

Additionally, the paper does not provide much insight into the generalizability of the proposed techniques. It would be helpful to understand how well the ShiftAddViT approach might work for other types of Transformer-based models beyond just ViTs, or if there are any limitations in the types of vision tasks it can be effectively applied to.

Overall, the paper presents a promising direction for improving the efficiency of ViT models, and the ShiftAddViT and MoE techniques could be useful contributions to the ongoing efforts in accelerating Transformer-based models and [developing more efficient vision Transformers.

Conclusion

The paper proposes a new model called ShiftAddViT that reparameterizes pre-trained Vision Transformers (ViTs) to replace computationally expensive matrix multiplications with more efficient bitwise shifts and additions. This approach, combined with a novel mixture of experts framework for handling the MLP layers, can achieve significant inference speedups on GPUs without requiring full retraining of the model.

The key contributions of this work are the development of the ShiftAddViT model and the latency-aware load-balancing loss function, which together enable efficient hardware deployment of Transformer-based vision models. These techniques could have important implications for the real-world application of ViTs, making them more practical and accessible for a wide range of use cases.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

👀

ShiftAddViT: Mixture of Multiplication Primitives Towards Efficient Vision Transformer

Haoran You (Celine), Huihong Shi (Celine), Yipin Guo (Celine), Yingyan (Celine), Lin

Vision Transformers (ViTs) have shown impressive performance and have become a unified backbone for multiple vision tasks. However, both the attention mechanism and multi-layer perceptrons (MLPs) in ViTs are not sufficiently efficient due to dense multiplications, leading to costly training and inference. To this end, we propose to reparameterize pre-trained ViTs with a mixture of multiplication primitives, e.g., bitwise shifts and additions, towards a new type of multiplication-reduced model, dubbed $textbf{ShiftAddViT}$, which aims to achieve end-to-end inference speedups on GPUs without requiring training from scratch. Specifically, all $texttt{MatMuls}$ among queries, keys, and values are reparameterized using additive kernels, after mapping queries and keys to binary codes in Hamming space. The remaining MLPs or linear layers are then reparameterized with shift kernels. We utilize TVM to implement and optimize those customized kernels for practical hardware deployment on GPUs. We find that such a reparameterization on attention maintains model accuracy, while inevitably leading to accuracy drops when being applied to MLPs. To marry the best of both worlds, we further propose a new mixture of experts (MoE) framework to reparameterize MLPs by taking multiplication or its primitives as experts, e.g., multiplication and shift, and designing a new latency-aware load-balancing loss. Such a loss helps to train a generic router for assigning a dynamic amount of input tokens to different experts according to their latency. Extensive experiments on various 2D/3D Transformer-based vision tasks consistently validate the effectiveness of our proposed ShiftAddViT, achieving up to $textbf{5.18$times$}$ latency reductions on GPUs and $textbf{42.9}$% energy savings, while maintaining a comparable accuracy as original or efficient ViTs.

6/12/2024

ShiftAddAug: Augment Multiplication-Free Tiny Neural Network with Hybrid Computation

Yipin Guo, Zihao Li, Yilin Lang, Qinyuan Ren

Operators devoid of multiplication, such as Shift and Add, have gained prominence for their compatibility with hardware. However, neural networks (NNs) employing these operators typically exhibit lower accuracy compared to conventional NNs with identical structures. ShiftAddAug uses costly multiplication to augment efficient but less powerful multiplication-free operators, improving performance without any inference overhead. It puts a ShiftAdd tiny NN into a large multiplicative model and encourages it to be trained as a sub-model to obtain additional supervision. In order to solve the weight discrepancy problem between hybrid operators, a new weight sharing method is proposed. Additionally, a novel two stage neural architecture search is used to obtain better augmentation effects for smaller but stronger multiplication-free tiny neural networks. The superiority of ShiftAddAug is validated through experiments in image classification and semantic segmentation, consistently delivering noteworthy enhancements. Remarkably, it secures up to a 4.95% increase in accuracy on the CIFAR100 compared to its directly trained counterparts, even surpassing the performance of multiplicative NNs.

7/4/2024

ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization

Haoran You, Yipin Guo, Yichao Fu, Wei Zhou, Huihong Shi, Xiaofan Zhang, Souvik Kundu, Amir Yazdanbakhsh, Yingyan Celine Lin

Large language models (LLMs) have shown impressive performance on language tasks but face challenges when deployed on resource-constrained devices due to their extensive parameters and reliance on dense multiplications, resulting in high memory demands and latency bottlenecks. Shift-and-add reparameterization offers a promising solution by replacing costly multiplications with hardware-friendly primitives in both the attention and multi-layer perceptron (MLP) layers of an LLM. However, current reparameterization techniques require training from scratch or full parameter fine-tuning to restore accuracy, which is resource-intensive for LLMs. To address this, we propose accelerating pretrained LLMs through post-training shift-and-add reparameterization, creating efficient multiplication-free models, dubbed ShiftAddLLM. Specifically, we quantize each weight matrix into binary matrices paired with group-wise scaling factors. The associated multiplications are reparameterized into (1) shifts between activations and scaling factors and (2) queries and adds according to the binary matrices. To reduce accuracy loss, we present a multi-objective optimization method to minimize both weight and output activation reparameterization errors. Additionally, based on varying sensitivity across layers to reparameterization, we develop an automated bit allocation strategy to further reduce memory usage and latency. Experiments on five LLM families and eight tasks consistently validate the effectiveness of ShiftAddLLM, achieving average perplexity improvements of 5.6 and 22.7 points at comparable or lower latency compared to the most competitive quantized LLMs at 3 and 2 bits, respectively, and more than 80% memory and energy reductions over the original LLMs. Codes and models are available at https://github.com/GATECH-EIC/ShiftAddLLM.

7/26/2024

Vote&Mix: Plug-and-Play Token Reduction for Efficient Vision Transformer

Shuai Peng, Di Fu, Baole Wei, Yong Cao, Liangcai Gao, Zhi Tang

Despite the remarkable success of Vision Transformers (ViTs) in various visual tasks, they are often hindered by substantial computational cost. In this work, we introduce Vote&Mix (textbf{VoMix}), a plug-and-play and parameter-free token reduction method, which can be readily applied to off-the-shelf ViT models textit{without any training}. VoMix tackles the computational redundancy of ViTs by identifying tokens with high homogeneity through a layer-wise token similarity voting mechanism. Subsequently, the selected tokens are mixed into the retained set, thereby preserving visual information. Experiments demonstrate VoMix significantly improves the speed-accuracy tradeoff of ViTs on both images and videos. Without any training, VoMix achieves a 2$times$ increase in throughput of existing ViT-H on ImageNet-1K and a 2.4$times$ increase in throughput of existing ViT-L on Kinetics-400 video dataset, with a mere 0.3% drop in top-1 accuracy.

9/2/2024