Bootstrapping SparseFormers from Vision Foundation Models

Read original: arXiv:2312.01987 - Published 4/5/2024 by Ziteng Gao, Zhan Tong, Kevin Qinghong Lin, Joya Chen, Mike Zheng Shou

👀

Overview

The recently proposed SparseFormer architecture offers an alternative approach to visual understanding that uses significantly fewer visual tokens, reducing computational costs while still achieving promising performance.
However, training SparseFormers from scratch is expensive, and scaling up the number of parameters can be challenging.
This paper proposes a method to bootstrap SparseFormers from pre-trained ViT-based vision foundation models in a simple and efficient way.

Plain English Explanation

SparseFormer is a new architecture for computer vision tasks that uses fewer visual "tokens" or building blocks compared to traditional approaches. This reduces the computational resources required, making it more efficient. However, training SparseFormer models from scratch is still costly and difficult to scale up.

This research paper introduces a way to "bootstrap" SparseFormer models by starting with large, pre-trained vision transformer models. Since most of the SparseFormer architecture is similar to standard transformer models, the authors can reuse the pre-trained weights and only need to fine-tune a few key components. This allows them to create SparseFormer models quickly and with less training data.

The result is a SparseFormer model that can achieve high performance on image recognition tasks, like the ImageNet challenge, using only a small number of visual tokens. The authors also show how this approach can be extended to create multimodal models that can handle both images and text, without requiring captions or labels during the bootstrapping process.

This work demonstrates a clever way to leverage existing pre-trained vision models to build more efficient computer vision systems, potentially making them more accessible and usable in real-world applications.

Technical Explanation

The core idea of this paper is to "bootstrap" SparseFormer architectures from large-scale pre-trained vision transformer models. Since most of the SparseFormer blocks are standard transformer layers, the authors can inherit weights from these pre-trained models and freeze them as much as possible.

The key SparseFormer-specific component is a lightweight "focusing transformer" that adjusts the regions of interest (RoIs) to reduce the number of visual tokens. The authors only need to train this focusing transformer and fine-tune a few early pre-trained blocks to align the final token representation.

By bootstrapping in this way, the researchers can create SparseFormer models using a relatively small amount of training data (e.g., the ImageNet-1K dataset) and without requiring any labels or captions. For example, a unimodal SparseFormer bootstrapped from an AugReg-ViT-L/16-384 model can reach 84.9% accuracy on ImageNet-1K using only 49 visual tokens.

The authors also demonstrate the flexibility of this approach by creating multimodal SparseFormer models bootstrapped from CLIP, a large language model pre-trained on image-text pairs. These CLIP-bootstrapped SparseFormers can achieve notable zero-shot performance on various tasks, all while using far fewer computational resources than the original CLIP model.

Critical Analysis

The paper presents a clever and efficient way to create SparseFormer models by leveraging existing pre-trained vision transformers. This approach addresses the key challenges of training SparseFormers from scratch, which can be expensive and difficult to scale.

However, the authors do not provide a detailed analysis of the limitations or potential drawbacks of this bootstrapping method. For example, it's unclear how the performance of the bootstrapped SparseFormers compares to models trained from scratch on the same datasets and tasks.

Additionally, the paper focuses on image recognition tasks, but it would be valuable to understand how well this approach generalizes to other computer vision problems, such as semantic segmentation or open-vocabulary object detection.

Overall, the research presents a promising direction for building efficient and capable computer vision systems by leveraging pre-trained models. Further exploration of the limitations and broader applicability of this bootstrapping approach could strengthen the impact of this work.

Conclusion

This paper introduces a method for efficiently bootstrapping SparseFormer architectures from large-scale pre-trained vision transformer models. By reusing the weights of the standard transformer blocks and fine-tuning only a few key components, the authors can create SparseFormer models that achieve high performance on image recognition tasks using significantly fewer visual tokens.

This work demonstrates the potential for leveraging existing pre-trained models to build more efficient and accessible computer vision systems. The ability to quickly create SparseFormer models with limited training data and without captions or labels could enable the deployment of these models in a wide range of real-world applications, from robotics to medical imaging.

As the field of computer vision continues to evolve, techniques like this bootstrapping approach may play an important role in making advanced vision models more practical and scalable, ultimately driving progress in areas that rely on efficient and effective visual understanding.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

👀

Bootstrapping SparseFormers from Vision Foundation Models

Ziteng Gao, Zhan Tong, Kevin Qinghong Lin, Joya Chen, Mike Zheng Shou

The recently proposed SparseFormer architecture provides an alternative approach to visual understanding by utilizing a significantly lower number of visual tokens via adjusting RoIs, greatly reducing computational costs while still achieving promising performance. However, training SparseFormers from scratch is still expensive, and scaling up the number of parameters can be challenging. In this paper, we propose to bootstrap SparseFormers from ViT-based vision foundation models in a simple and efficient way. Since the majority of SparseFormer blocks are the standard transformer ones, we can inherit weights from large-scale pre-trained vision transformers and freeze them as much as possible. Therefore, we only need to train the SparseFormer-specific lightweight focusing transformer to adjust token RoIs and fine-tune a few early pre-trained blocks to align the final token representation. In such a way, we can bootstrap SparseFormer architectures from various large-scale pre-trained models (e.g., IN-21K pre-trained AugRegs or CLIPs) using a rather smaller amount of training samples (e.g., IN-1K) and without labels or captions within just a few hours. As a result, the bootstrapped unimodal SparseFormer (from AugReg-ViT-L/16-384) can reach 84.9% accuracy on IN-1K with only 49 tokens, and the multimodal SparseFormer from CLIPs also demonstrates notable zero-shot performance with highly reduced computational cost without seeing any caption during the bootstrapping procedure. In addition, CLIP-bootstrapped SparseFormers, which align the output space with language without seeing a word, can serve as efficient vision encoders in multimodal large language models. Code and models are available at https://github.com/showlab/sparseformer

4/5/2024

👀

Vision Transformer with Sparse Scan Prior

Qihang Fan, Huaibo Huang, Mingrui Chen, Ran He

In recent years, Transformers have achieved remarkable progress in computer vision tasks. However, their global modeling often comes with substantial computational overhead, in stark contrast to the human eye's efficient information processing. Inspired by the human eye's sparse scanning mechanism, we propose a textbf{S}parse textbf{S}can textbf{S}elf-textbf{A}ttention mechanism ($rm{S}^3rm{A}$). This mechanism predefines a series of Anchors of Interest for each token and employs local attention to efficiently model the spatial information around these anchors, avoiding redundant global modeling and excessive focus on local information. This approach mirrors the human eye's functionality and significantly reduces the computational load of vision models. Building on $rm{S}^3rm{A}$, we introduce the textbf{S}parse textbf{S}can textbf{Vi}sion textbf{T}ransformer (SSViT). Extensive experiments demonstrate the outstanding performance of SSViT across a variety of tasks. Specifically, on ImageNet classification, without additional supervision or training data, SSViT achieves top-1 accuracies of textbf{84.4%/85.7%} with textbf{4.4G/18.2G} FLOPs. SSViT also excels in downstream tasks such as object detection, instance segmentation, and semantic segmentation. Its robustness is further validated across diverse datasets. Code will be available at url{https://github.com/qhfan/SSViT}.

5/24/2024

New!VistaFormer: Scalable Vision Transformers for Satellite Image Time Series Segmentation

Ezra MacDonald, Derek Jacoby, Yvonne Coady

We introduce VistaFormer, a lightweight Transformer-based model architecture for the semantic segmentation of remote-sensing images. This model uses a multi-scale Transformer-based encoder with a lightweight decoder that aggregates global and local attention captured in the encoder blocks. VistaFormer uses position-free self-attention layers which simplifies the model architecture and removes the need to interpolate temporal and spatial codes, which can reduce model performance when training and testing image resolutions differ. We investigate simple techniques for filtering noisy input signals like clouds and demonstrate that improved model scalability can be achieved by substituting Multi-Head Self-Attention (MHSA) with Neighbourhood Attention (NA). Experiments on the PASTIS and MTLCC crop-type segmentation benchmarks show that VistaFormer achieves better performance than comparable models and requires only 8% of the floating point operations using MHSA and 11% using NA while also using fewer trainable parameters. VistaFormer with MHSA improves on state-of-the-art mIoU scores by 0.1% on the PASTIS benchmark and 3% on the MTLCC benchmark while VistaFormer with NA improves on the MTLCC benchmark by 3.7%.

9/16/2024

👀

A Spitting Image: Modular Superpixel Tokenization in Vision Transformers

Marius Aasan, Odd Kolbj{o}rnsen, Anne Schistad Solberg, Ad'in Ramirez Rivera

Vision Transformer (ViT) architectures traditionally employ a grid-based approach to tokenization independent of the semantic content of an image. We propose a modular superpixel tokenization strategy which decouples tokenization and feature extraction; a shift from contemporary approaches where these are treated as an undifferentiated whole. Using on-line content-aware tokenization and scale- and shape-invariant positional embeddings, we perform experiments and ablations that contrast our approach with patch-based tokenization and randomized partitions as baselines. We show that our method significantly improves the faithfulness of attributions, gives pixel-level granularity on zero-shot unsupervised dense prediction tasks, while maintaining predictive performance in classification tasks. Our approach provides a modular tokenization framework commensurable with standard architectures, extending the space of ViTs to a larger class of semantically-rich models.

8/16/2024