Castling-ViT: Compressing Self-Attention via Switching Towards Linear-Angular Attention at Vision Transformer Inference

Read original: arXiv:2211.10526 - Published 7/26/2024 by Haoran You, Yunyang Xiong, Xiaoliang Dai, Bichen Wu, Peizhao Zhang, Haoqi Fan, Peter Vajda, Yingyan Celine Lin

👀

Overview

Vision Transformers (ViTs) have shown impressive performance but require high computation costs compared to Convolutional Neural Networks (CNNs).
A key reason is that ViTs' attention mechanism has quadratic complexity with the number of input tokens, as it measures global similarities.
Existing efficient ViTs, like Swin and Performer, use local or linear attention, but sacrifice ViTs' ability to capture global or local context.
This paper proposes a framework called Castling-ViT that aims to enable ViTs to learn both global and local context while being more efficient during inference.

Plain English Explanation

The paper focuses on Vision Transformers (ViTs), a type of AI model that has shown impressive performance but requires a lot of computing power. One reason for this is that ViTs use an attention mechanism that compares each input element to all other elements, which is computationally expensive.

Some existing ViTs try to be more efficient by only looking at nearby elements (local attention) or by using a simpler attention calculation (linear attention). However, this means they lose the ability to capture important global or local information.

The researchers behind this paper ask an important question: can ViTs learn both global and local context while also being more efficient? To address this, they propose a new framework called Castling-ViT.

The key ideas in Castling-ViT are:

Using both a global attention mechanism and a more efficient local attention mechanism during training.
Simplifying the global attention mechanism by decomposing it into a linear part and a high-order residual part, and only keeping the linear part.
Using two additional modules - a depthwise convolution and an auxiliary attention mechanism - to help the model learn both global and local information, while ensuring these modules don't add overhead during final inference.

By using this approach, the researchers were able to achieve higher accuracy or significant reductions in computational cost compared to standard ViTs, while still maintaining the model's ability to capture both global and local information.

Technical Explanation

The paper proposes a framework called Castling-ViT that aims to enable Vision Transformers (ViTs) to learn both global and local context while being more efficient during inference.

The key components of Castling-ViT are:

Linear-Angular Attention: Castling-ViT uses a combination of linear attention and quadratic (softmax-based) attention during training. The quadratic attention allows the model to capture global similarities, while the linear attention is more efficient. The linear attention is achieved by decomposing the angular kernels (used to measure similarities) into linear terms and high-order residuals, and only keeping the linear terms.
Auxiliary Modules: Castling-ViT uses two additional modules to help the model learn both global and local information:
- A depthwise convolution module to capture local context.
- An auxiliary masked softmax attention module. The masks for this attention are regularized to gradually become zeros, so this module doesn't add overhead during final inference.

During inference, Castling-ViT only uses the linear-angular attention, which is more efficient than the full quadratic attention used during training.

The paper evaluates Castling-ViT on image classification (ImageNet) and object detection (COCO) tasks, and shows it can achieve up to 1.8% higher accuracy or 40% reduction in computational cost (MACs) compared to standard ViTs with vanilla softmax-based attention.

Critical Analysis

The paper presents a compelling approach to improving the efficiency of Vision Transformers while preserving their ability to capture both global and local context. The key ideas, such as decomposing the attention mechanism and using auxiliary modules, are technically sound and well-motivated.

One potential limitation is that the paper does not provide a deep analysis of the learned attention patterns or masks. It would be interesting to understand how the model is balancing global and local information, and whether the learned attention mechanisms align with human intuitions about relevant features.

Additionally, the paper focuses on image classification and object detection tasks. It would be valuable to evaluate the Castling-ViT framework on a broader range of computer vision problems, such as semantic segmentation or video understanding, to better understand its general applicability.

Overall, the Castling-ViT framework represents a promising direction for making Vision Transformers more practical and widely usable, and the paper provides a solid technical foundation for further research in this area.

Conclusion

This paper introduces a new framework called Castling-ViT that aims to enable Vision Transformers to learn both global and local context while being more efficient during inference. The key ideas include using a combination of linear and quadratic attention, as well as auxiliary modules to capture local information, while ensuring minimal overhead during final inference.

The results show that Castling-ViT can achieve significant improvements in accuracy or computational cost compared to standard ViTs, making it a compelling approach for deploying ViTs in real-world applications. The framework represents an important step forward in making Vision Transformers more practical and widely usable, and the techniques introduced in this paper could have broader implications for efficient neural network design.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

👀

Castling-ViT: Compressing Self-Attention via Switching Towards Linear-Angular Attention at Vision Transformer Inference

Haoran You, Yunyang Xiong, Xiaoliang Dai, Bichen Wu, Peizhao Zhang, Haoqi Fan, Peter Vajda, Yingyan Celine Lin

Vision Transformers (ViTs) have shown impressive performance but still require a high computation cost as compared to convolutional neural networks (CNNs), one reason is that ViTs' attention measures global similarities and thus has a quadratic complexity with the number of input tokens. Existing efficient ViTs adopt local attention (e.g., Swin) or linear attention (e.g., Performer), which sacrifice ViTs' capabilities of capturing either global or local context. In this work, we ask an important research question: Can ViTs learn both global and local context while being more efficient during inference? To this end, we propose a framework called Castling-ViT, which trains ViTs using both linear-angular attention and masked softmax-based quadratic attention, but then switches to having only linear angular attention during ViT inference. Our Castling-ViT leverages angular kernels to measure the similarities between queries and keys via spectral angles. And we further simplify it with two techniques: (1) a novel linear-angular attention mechanism: we decompose the angular kernels into linear terms and high-order residuals, and only keep the linear terms; and (2) we adopt two parameterized modules to approximate high-order residuals: a depthwise convolution and an auxiliary masked softmax attention to help learn both global and local information, where the masks for softmax attention are regularized to gradually become zeros and thus incur no overhead during ViT inference. Extensive experiments and ablation studies on three tasks consistently validate the effectiveness of the proposed Castling-ViT, e.g., achieving up to a 1.8% higher accuracy or 40% MACs reduction on ImageNet classification and 1.2 higher mAP on COCO detection under comparable FLOPs, as compared to ViTs with vanilla softmax-based attentions.

7/26/2024

You Only Need Less Attention at Each Stage in Vision Transformers

Shuoxi Zhang, Hanpeng Liu, Stephen Lin, Kun He

The advent of Vision Transformers (ViTs) marks a substantial paradigm shift in the realm of computer vision. ViTs capture the global information of images through self-attention modules, which perform dot product computations among patchified image tokens. While self-attention modules empower ViTs to capture long-range dependencies, the computational complexity grows quadratically with the number of tokens, which is a major hindrance to the practical application of ViTs. Moreover, the self-attention mechanism in deep ViTs is also susceptible to the attention saturation issue. Accordingly, we argue against the necessity of computing the attention scores in every layer, and we propose the Less-Attention Vision Transformer (LaViT), which computes only a few attention operations at each stage and calculates the subsequent feature alignments in other layers via attention transformations that leverage the previously calculated attention scores. This novel approach can mitigate two primary issues plaguing traditional self-attention modules: the heavy computational burden and attention saturation. Our proposed architecture offers superior efficiency and ease of implementation, merely requiring matrix multiplications that are highly optimized in contemporary deep learning frameworks. Moreover, our architecture demonstrates exceptional performance across various vision tasks including classification, detection and segmentation.

6/4/2024

CAS-ViT: Convolutional Additive Self-attention Vision Transformers for Efficient Mobile Applications

Tianfang Zhang, Lei Li, Yang Zhou, Wentao Liu, Chen Qian, Xiangyang Ji

Vision Transformers (ViTs) mark a revolutionary advance in neural networks with their token mixer's powerful global context capability. However, the pairwise token affinity and complex matrix operations limit its deployment on resource-constrained scenarios and real-time applications, such as mobile devices, although considerable efforts have been made in previous works. In this paper, we introduce CAS-ViT: Convolutional Additive Self-attention Vision Transformers, to achieve a balance between efficiency and performance in mobile applications. Firstly, we argue that the capability of token mixers to obtain global contextual information hinges on multiple information interactions, such as spatial and channel domains. Subsequently, we construct a novel additive similarity function following this paradigm and present an efficient implementation named Convolutional Additive Token Mixer (CATM). This simplification leads to a significant reduction in computational overhead. We evaluate CAS-ViT across a variety of vision tasks, including image classification, object detection, instance segmentation, and semantic segmentation. Our experiments, conducted on GPUs, ONNX, and iPhones, demonstrate that CAS-ViT achieves a competitive performance when compared to other state-of-the-art backbones, establishing it as a viable option for efficient mobile vision applications. Our code and model are available at: url{https://github.com/Tianfang-Zhang/CAS-ViT}

8/9/2024

👀

FasterViT: Fast Vision Transformers with Hierarchical Attention

Ali Hatamizadeh, Greg Heinrich, Hongxu Yin, Andrew Tao, Jose M. Alvarez, Jan Kautz, Pavlo Molchanov

We design a new family of hybrid CNN-ViT neural networks, named FasterViT, with a focus on high image throughput for computer vision (CV) applications. FasterViT combines the benefits of fast local representation learning in CNNs and global modeling properties in ViT. Our newly introduced Hierarchical Attention (HAT) approach decomposes global self-attention with quadratic complexity into a multi-level attention with reduced computational costs. We benefit from efficient window-based self-attention. Each window has access to dedicated carrier tokens that participate in local and global representation learning. At a high level, global self-attentions enable the efficient cross-window communication at lower costs. FasterViT achieves a SOTA Pareto-front in terms of accuracy and image throughput. We have extensively validated its effectiveness on various CV tasks including classification, object detection and segmentation. We also show that HAT can be used as a plug-and-play module for existing networks and enhance them. We further demonstrate significantly faster and more accurate performance than competitive counterparts for images with high resolution. Code is available at https://github.com/NVlabs/FasterViT.

4/3/2024