ViT-1.58b: Mobile Vision Transformers in the 1-bit Era

Read original: arXiv:2406.18051 - Published 6/27/2024 by Zhengqing Yuan, Rong Zhou, Hongyi Wang, Lifang He, Yanfang Ye, Lichao Sun

ViT-1.58b: Mobile Vision Transformers in the 1-bit Era

Overview

This paper presents ViT-1.58b, a mobile vision transformer model that can be efficiently deployed on 1-bit hardware.
The researchers developed a novel quantization approach and architectural modifications to enable accurate and fast inference on resource-constrained devices.
The proposed model outperforms existing mobile vision transformers in terms of accuracy, inference speed, and energy efficiency.

Plain English Explanation

The researchers have developed a new type of vision transformer model called ViT-1.58b that is designed to work well on mobile devices and other hardware with limited resources. Vision transformers are a powerful type of machine learning model that can be used for tasks like image classification and object detection.

Typically, these models require a lot of computing power to run, which makes them difficult to use on phones, tablets, and other mobile devices. The key innovation in ViT-1.58b is a new technique for compressing the model so that it can run much faster and more efficiently on 1-bit hardware.

The researchers achieved this by developing a novel quantization approach that allows the model to use very low-precision 1-bit weights and activations without sacrificing too much accuracy. They also made some architectural changes to the transformer model to further improve its efficiency.

As a result, ViT-1.58b is able to match or even outperform other mobile vision transformer models in terms of accuracy, speed, and energy efficiency. This could make it a useful tool for deploying powerful computer vision capabilities on resource-constrained devices like smartphones and IoT sensors.

Technical Explanation

The key technical innovations in ViT-1.58b are a novel post-training quantization method and architectural modifications to the transformer model.

The quantization approach is based on Q-HyVIT, a hybrid quantization technique that combines channel-wise and tensor-wise quantization. This allows the model to use very low-precision (1-bit) weights and activations without significant accuracy degradation.

The researchers also introduced several architectural changes to the transformer model to improve its efficiency:

Trio-ViT-inspired re-parameterization of the softmax operation
Efficient attention module design inspired by ViDiT-Q
Hybrid convolution-transformer blocks for better feature extraction

Through these innovations, the ViT-1.58b model achieves state-of-the-art performance on mobile vision transformer benchmarks, outperforming prior work in terms of accuracy, inference speed, and energy efficiency.

Critical Analysis

The paper thoroughly evaluates the ViT-1.58b model on a range of mobile vision tasks and hardware platforms, providing compelling evidence for its effectiveness. However, the authors acknowledge several limitations and areas for future work:

The quantization approach may not generalize as well to other transformer architectures beyond the proposed modifications.
The model was only evaluated on image classification tasks, and its performance on more complex computer vision problems like object detection or segmentation is still unknown.
The energy efficiency of the model was measured in simulation rather than real-world deployment, so the actual power consumption on mobile devices may differ.

Additionally, the paper does not discuss potential privacy or security implications of deploying such powerful computer vision models on resource-constrained edge devices. This is an important consideration as these models could be used for surveillance or other ethically sensitive applications.

Overall, the ViT-1.58b model represents a significant advance in making vision transformers practical for mobile and embedded use cases. However, further research is needed to fully understand its capabilities, limitations, and broader societal impact.

Conclusion

The ViT-1.58b model proposed in this paper demonstrates that it is possible to efficiently deploy powerful vision transformer models on mobile and edge devices. By developing novel quantization techniques and architectural modifications, the researchers have created a model that outperforms existing mobile vision transformers in terms of accuracy, speed, and energy efficiency.

This work has important implications for bringing advanced computer vision capabilities to a wide range of applications, from smart home devices to autonomous vehicles. As the demand for on-device AI continues to grow, models like ViT-1.58b could play a crucial role in enabling innovative and privacy-preserving applications on resource-constrained platforms.

However, it will be important for future research to carefully consider the ethical implications of deploying such powerful vision models in the real world. Ongoing work is needed to ensure these technologies are developed and deployed responsibly to benefit society as a whole.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

ViT-1.58b: Mobile Vision Transformers in the 1-bit Era

Zhengqing Yuan, Rong Zhou, Hongyi Wang, Lifang He, Yanfang Ye, Lichao Sun

Vision Transformers (ViTs) have achieved remarkable performance in various image classification tasks by leveraging the attention mechanism to process image patches as tokens. However, the high computational and memory demands of ViTs pose significant challenges for deployment in resource-constrained environments. This paper introduces ViT-1.58b, a novel 1.58-bit quantized ViT model designed to drastically reduce memory and computational overhead while preserving competitive performance. ViT-1.58b employs ternary quantization, which refines the balance between efficiency and accuracy by constraining weights to {-1, 0, 1} and quantizing activations to 8-bit precision. Our approach ensures efficient scaling in terms of both memory and computation. Experiments on CIFAR-10 and ImageNet-1k demonstrate that ViT-1.58b maintains comparable accuracy to full-precision Vit, with significant reductions in memory usage and computational costs. This paper highlights the potential of extreme quantization techniques in developing sustainable AI solutions and contributes to the broader discourse on efficient model deployment in practical applications. Our code and weights are available at https://github.com/DLYuanGod/ViT-1.58b.

6/27/2024

📈

Model Quantization and Hardware Acceleration for Vision Transformers: A Comprehensive Survey

Dayou Du, Gu Gong, Xiaowen Chu

Vision Transformers (ViTs) have recently garnered considerable attention, emerging as a promising alternative to convolutional neural networks (CNNs) in several vision-related applications. However, their large model sizes and high computational and memory demands hinder deployment, especially on resource-constrained devices. This underscores the necessity of algorithm-hardware co-design specific to ViTs, aiming to optimize their performance by tailoring both the algorithmic structure and the underlying hardware accelerator to each other's strengths. Model quantization, by converting high-precision numbers to lower-precision, reduces the computational demands and memory needs of ViTs, allowing the creation of hardware specifically optimized for these quantized algorithms, boosting efficiency. This article provides a comprehensive survey of ViTs quantization and its hardware acceleration. We first delve into the unique architectural attributes of ViTs and their runtime characteristics. Subsequently, we examine the fundamental principles of model quantization, followed by a comparative analysis of the state-of-the-art quantization techniques for ViTs. Additionally, we explore the hardware acceleration of quantized ViTs, highlighting the importance of hardware-friendly algorithm design. In conclusion, this article will discuss ongoing challenges and future research paths. We consistently maintain the related open-source materials at https://github.com/DD-DuDa/awesome-vit-quantization-acceleration.

5/2/2024

Quasar-ViT: Hardware-Oriented Quantization-Aware Architecture Search for Vision Transformers

Zhengang Li, Alec Lu, Yanyue Xie, Zhenglun Kong, Mengshu Sun, Hao Tang, Zhong Jia Xue, Peiyan Dong, Caiwen Ding, Yanzhi Wang, Xue Lin, Zhenman Fang

Vision transformers (ViTs) have demonstrated their superior accuracy for computer vision tasks compared to convolutional neural networks (CNNs). However, ViT models are often computation-intensive for efficient deployment on resource-limited edge devices. This work proposes Quasar-ViT, a hardware-oriented quantization-aware architecture search framework for ViTs, to design efficient ViT models for hardware implementation while preserving the accuracy. First, Quasar-ViT trains a supernet using our row-wise flexible mixed-precision quantization scheme, mixed-precision weight entanglement, and supernet layer scaling techniques. Then, it applies an efficient hardware-oriented search algorithm, integrated with hardware latency and resource modeling, to determine a series of optimal subnets from supernet under different inference latency targets. Finally, we propose a series of model-adaptive designs on the FPGA platform to support the architecture search and mitigate the gap between the theoretical computation reduction and the practical inference speedup. Our searched models achieve 101.5, 159.6, and 251.6 frames-per-second (FPS) inference speed on the AMD/Xilinx ZCU102 FPGA with 80.4%, 78.6%, and 74.9% top-1 accuracy, respectively, for the ImageNet dataset, consistently outperforming prior works.

7/26/2024

👀

Q-HyViT: Post-Training Quantization of Hybrid Vision Transformers with Bridge Block Reconstruction for IoT Systems

Jemin Lee, Yongin Kwon, Sihyeong Park, Misun Yu, Jeman Park, Hwanjun Song

Recently, vision transformers (ViTs) have superseded convolutional neural networks in numerous applications, including classification, detection, and segmentation. However, the high computational requirements of ViTs hinder their widespread implementation. To address this issue, researchers have proposed efficient hybrid transformer architectures that combine convolutional and transformer layers with optimized attention computation of linear complexity. Additionally, post-training quantization has been proposed as a means of mitigating computational demands. For mobile devices, achieving optimal acceleration for ViTs necessitates the strategic integration of quantization techniques and efficient hybrid transformer structures. However, no prior investigation has applied quantization to efficient hybrid transformers. In this paper, we discover that applying existing post-training quantization (PTQ) methods for ViTs to efficient hybrid transformers leads to a drastic accuracy drop, attributed to the four following challenges: (i) highly dynamic ranges, (ii) zero-point overflow, (iii) diverse normalization, and (iv) limited model parameters ($<$5M). To overcome these challenges, we propose a new post-training quantization method, which is the first to quantize efficient hybrid ViTs (MobileViTv1, MobileViTv2, Mobile-Former, EfficientFormerV1, EfficientFormerV2). We achieve a significant improvement of 17.73% for 8-bit and 29.75% for 6-bit on average, respectively, compared with existing PTQ methods (EasyQuant, FQ-ViT, PTQ4ViT, and RepQ-ViT)}. We plan to release our code at https://gitlab.com/ones-ai/q-hyvit.

5/20/2024