Morphing Tokens Draw Strong Masked Image Models

2401.00254

Published 5/3/2024 by Taekyung Kim, Byeongho Heo, Dongyoon Han

Morphing Tokens Draw Strong Masked Image Models

Abstract

Masked image modeling (MIM) is a promising option for training Vision Transformers among various self-supervised learning (SSL) methods. The essence of MIM lies in token-wise masked token predictions, with targets tokenized from images or generated by pre-trained models such as vision-language models. While tokenizers or pre-trained models are plausible MIM targets, they often offer spatially inconsistent targets even for neighboring tokens, complicating models to learn unified discriminative representations. Our pilot study confirms that addressing spatial inconsistencies has the potential to enhance representation quality. Motivated by the findings, we introduce a novel self-supervision signal called Dynamic Token Morphing (DTM), which dynamically aggregates contextually related tokens to yield contextualized targets. DTM is compatible with various SSL frameworks; we showcase an improved MIM by employing DTM, barely introducing extra training costs. Our experiments on ImageNet-1K and ADE20K demonstrate the superiority of our methods compared with state-of-the-art, complex MIM methods. Furthermore, the comparative evaluation of the iNaturalists and fine-grained visual classification datasets further validates the transferability of our method on various downstream tasks. Code is available at https://github.com/naver-ai/dtm

Create account to get full access

Overview

This paper proposes a new method called "Dynamic Token Morphing" to improve the performance of masked image modeling, a technique used to pre-train vision transformers.
The key idea is to dynamically change the shape and location of masked image tokens during training to help the model learn more effective representations.
The authors conduct experiments on various vision tasks and show that their approach outperforms previous methods for masked image modeling.

Plain English Explanation

The paper introduces a new technique called "Dynamic Token Morphing" to improve a machine learning method called "masked image modeling". Masked image modeling is used to pre-train vision transformers, which are a type of AI model that can process and understand images.

The core idea behind Dynamic Token Morphing is to dynamically change the shape and location of the "masked" parts of the image during training. Masking means hiding certain parts of the image from the model, forcing it to learn how to predict what's behind the masks. By dynamically changing the masks, the model is forced to learn more versatile and robust representations of the image content.

The authors show that their Dynamic Token Morphing approach outperforms previous methods for masked image modeling across a variety of computer vision tasks. This suggests it could be an effective way to pre-train powerful vision AI models that can see and understand images more like humans do.

Technical Explanation

The paper focuses on improving masked image modeling, a technique used to pre-train vision transformers and other image-based AI models. In masked image modeling, the model is trained to predict the content of "masked" or hidden regions in an image.

The key innovation in this paper is the "Dynamic Token Morphing" approach. Instead of using static, fixed masks during training, the authors propose dynamically changing the shape and position of the masked regions. This is done by applying random spatial transformations (e.g. scaling, rotation, shear) to the masked tokens.

The authors hypothesize that this dynamic masking strategy forces the model to learn more robust and generalizable visual representations, as it has to deal with a wider variety of masking patterns during training. They conduct experiments on several vision benchmarks and show that their Dynamic Token Morphing approach outperforms previous masked image modeling techniques, such as salience-based adaptive masking and random orthogonal projection.

The authors also provide an in-depth analysis of the model's behavior, examining factors like the impact of mask size, transformation intensity, and the trade-off between masking difficulty and task performance. Overall, the work demonstrates the potential of Dynamic Token Morphing to improve the effectiveness of masked image modeling and, by extension, the pre-training of powerful vision AI models.

Critical Analysis

The paper presents a well-designed study that makes a valuable contribution to the field of masked image modeling. The Dynamic Token Morphing approach seems to be a promising technique for improving the pre-training of vision transformers and other image-based AI models.

One potential limitation of the work is that it focuses primarily on standard computer vision benchmarks, such as ImageNet classification and COCO object detection. It would be interesting to see how the Dynamic Token Morphing approach performs on more diverse and challenging real-world image understanding tasks.

Additionally, the paper does not provide much insight into the underlying mechanisms that make Dynamic Token Morphing effective. Further research could explore the cognitive and computational principles that explain why dynamically changing the masked regions leads to better visual representations.

Overall, this paper offers a solid technical contribution and lays the groundwork for future research on more robust and versatile pre-training strategies for vision AI models.

Conclusion

The "Masked Image Modeling via Dynamic Token Morphing" paper introduces a novel technique to improve the pre-training of vision transformers and other image-based AI models. By dynamically changing the shape and location of masked image regions during training, the model is forced to learn more generalizable visual representations.

The authors demonstrate the effectiveness of their Dynamic Token Morphing approach through experiments on various computer vision tasks, showing that it outperforms previous masked image modeling methods. This work represents an important step forward in developing more powerful and adaptable vision AI systems that can better understand and reason about the visual world.

As the field of machine learning continues to advance, techniques like Dynamic Token Morphing will likely play an increasingly important role in the development of increasingly capable and versatile AI models for a wide range of image-based applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Emerging Property of Masked Token for Effective Pre-training

Hyesong Choi, Hunsang Lee, Seyoung Joung, Hyejin Park, Jiyeong Kim, Dongbo Min

Driven by the success of Masked Language Modeling (MLM), the realm of self-supervised learning for computer vision has been invigorated by the central role of Masked Image Modeling (MIM) in driving recent breakthroughs. Notwithstanding the achievements of MIM across various downstream tasks, its overall efficiency is occasionally hampered by the lengthy duration of the pre-training phase. This paper presents a perspective that the optimization of masked tokens as a means of addressing the prevailing issue. Initially, we delve into an exploration of the inherent properties that a masked token ought to possess. Within the properties, we principally dedicated to articulating and emphasizing the `data singularity' attribute inherent in masked tokens. Through a comprehensive analysis of the heterogeneity between masked tokens and visible tokens within pre-trained models, we propose a novel approach termed masked token optimization (MTO), specifically designed to improve model efficiency through weight recalibration and the enhancement of the key property of masked tokens. The proposed method serves as an adaptable solution that seamlessly integrates into any MIM approach that leverages masked tokens. As a result, MTO achieves a considerable improvement in pre-training efficiency, resulting in an approximately 50% reduction in pre-training epochs required to attain converged performance of the recent approaches.

4/15/2024

cs.CV

SeiT++: Masked Token Modeling Improves Storage-efficient Training

Minhyun Lee, Song Park, Byeongho Heo, Dongyoon Han, Hyunjung Shim

Recent advancements in Deep Neural Network (DNN) models have significantly improved performance across computer vision tasks. However, achieving highly generalizable and high-performing vision models requires expansive datasets, resulting in significant storage requirements. This storage challenge is a critical bottleneck for scaling up models. A recent breakthrough by SeiT proposed the use of Vector-Quantized (VQ) feature vectors (i.e., tokens) as network inputs for vision classification. This approach achieved 90% of the performance of a model trained on full-pixel images with only 1% of the storage. While SeiT needs labeled data, its potential in scenarios beyond fully supervised learning remains largely untapped. In this paper, we extend SeiT by integrating Masked Token Modeling (MTM) for self-supervised pre-training. Recognizing that self-supervised approaches often demand more data due to the lack of labels, we introduce TokenAdapt and ColorAdapt. These methods facilitate comprehensive token-friendly data augmentation, effectively addressing the increased data requirements of self-supervised learning. We evaluate our approach across various scenarios, including storage-efficient ImageNet-1k classification, fine-grained classification, ADE-20k semantic segmentation, and robustness benchmarks. Experimental results demonstrate consistent performance improvement in diverse experiments, validating the effectiveness of our method. Code is available at https://github.com/naver-ai/tokenadapt.

4/4/2024

cs.CV

SemanticMIM: Marring Masked Image Modeling with Semantics Compression for General Visual Representation

Yike Yuan, Huanzhang Dou, Fengjun Guo, Xi Li

This paper represents a neat yet effective framework, named SemanticMIM, to integrate the advantages of masked image modeling (MIM) and contrastive learning (CL) for general visual representation. We conduct a thorough comparative analysis between CL and MIM, revealing that their complementary advantages fundamentally stem from two distinct phases, i.e., compression and reconstruction. Specifically, SemanticMIM leverages a proxy architecture that customizes interaction between image and mask tokens, bridging these two phases to achieve general visual representation with the property of abundant semantic and positional awareness. Through extensive qualitative and quantitative evaluations, we demonstrate that SemanticMIM effectively amalgamates the benefits of CL and MIM, leading to significant enhancement of performance and feature linear separability. SemanticMIM also offers notable interpretability through attention response visualization. Codes are available at https://github.com/yyk-wew/SemanticMIM.

6/18/2024

cs.CV

🚀

Auto-Encoding Morph-Tokens for Multimodal LLM

Kaihang Pan, Siliang Tang, Juncheng Li, Zhaoyu Fan, Wei Chow, Shuicheng Yan, Tat-Seng Chua, Yueting Zhuang, Hanwang Zhang

For multimodal LLMs, the synergy of visual comprehension (textual output) and generation (visual output) presents an ongoing challenge. This is due to a conflicting objective: for comprehension, an MLLM needs to abstract the visuals; for generation, it needs to preserve the visuals as much as possible. Thus, the objective is a dilemma for visual-tokens. To resolve the conflict, we propose encoding images into morph-tokens to serve a dual purpose: for comprehension, they act as visual prompts instructing MLLM to generate texts; for generation, they take on a different, non-conflicting role as complete visual-tokens for image reconstruction, where the missing visual cues are recovered by the MLLM. Extensive experiments show that morph-tokens can achieve a new SOTA for multimodal comprehension and generation simultaneously. Our project is available at https://github.com/DCDmllm/MorphTokens.

5/6/2024

cs.CV