Learning with Unmasked Tokens Drives Stronger Vision Learners

Read original: arXiv:2310.13593 - Published 8/27/2024 by Taekyung Kim, Sanghyuk Chun, Byeongho Heo, Dongyoon Han

👀

Overview

Masked image modeling (MIM) is a popular self-supervised learning strategy for learning strong visual representations.
MIMs like Masked Autoencoder (MAE) train an encoder to process randomly masked input tokens, while a decoder reconstructs the masked tokens.
However, MIM-trained encoders often exhibit a limited attention span, as they only focus on regressing the masked tokens during pretraining.
This may impede the encoder's ability to learn broader contextual information.

Plain English Explanation

Masked image modeling (MIM) is a technique used to train AI models to understand images without being explicitly told what the images contain. The idea is to randomly hide or "mask" parts of the image, then have the model try to figure out what the hidden parts are based on the rest of the image.

This approach, used in models like Masked Autoencoder (MAE), has been shown to help the model learn strong visual representations - it can understand the overall meaning and context of an image, not just individual objects or features.

However, the researchers found that models trained this way sometimes have a limited "attention span" - they focus too narrowly on just the masked parts, without really understanding the broader context of the whole image. This can potentially hold back the model's performance on certain tasks.

To address this, the researchers developed a new approach that explicitly trains the model to also pay attention to the

unmasked

parts of the image, not just the masked parts. The idea is to give the model a richer understanding of the overall context, which can then help it better reconstruct the masked areas.

The researchers show that this simple change leads to more discriminative representations, with the model achieving higher accuracy on the ImageNet benchmark. They also find that the enhanced pre-training method boosts performance on downstream tasks like semantic segmentation and fine-grained classification.

Technical Explanation

The core idea of the researchers' approach is to explicitly incorporate the

unmasked

tokens into the training process, in addition to the standard MIM objective of reconstructing the masked tokens.

Specifically, the encoder is trained to not only process the masked tokens, but also learn from the broader context provided by the unmasked tokens. This allows the unmasked tokens to experience richer contextual information, which can then benefit the reconstruction of the masked tokens.

The researchers hypothesize that this broader context supervision helps address the limited attention span often observed in MIM-trained encoders, which may be due to the sole focus on regressing the masked tokens.

Experiments show that this simple modification leads to significant performance gains on ImageNet classification, as well as downstream tasks like semantic segmentation and fine-grained visual classification. Analyses of the singular value spectrum and attention patterns reveal that the enhanced pre-training method indeed trains more discriminative representations.

Critical Analysis

The researchers acknowledge that their method only provides a partial solution to the limited attention span issue in MIM. While the broader context supervision helps, there may be other factors contributing to this problem that are not addressed.

Additionally, the paper does not explore the potential trade-offs or downsides of the proposed approach. It is unclear if the enhanced contextual understanding comes at the cost of other desirable properties, such as efficiency or robustness.

Further research could delve into the interplay between different objectives and design choices in MIM pretraining, and investigate more holistic approaches to improving the learned representations. Exploring the generalizability of the findings to other MIM variants and downstream tasks would also be valuable.

Overall, the work presents a promising step towards enhancing MIM-based representation learning, but there remains room for further exploration and refinement of these techniques.

Conclusion

The researchers have proposed a simple yet effective modification to the standard masked image modeling (MIM) approach, which aims to address the limited attention span often observed in MIM-trained encoders.

By explicitly incorporating the unmasked tokens into the training process, the encoder is able to learn from broader contextual information, leading to more discriminative representations. This is evidenced by the significant performance gains on ImageNet classification and downstream tasks like semantic segmentation and fine-grained visual classification.

The work highlights the importance of considering the broader context in self-supervised representation learning, and suggests that further research in this direction could yield important insights and advancements in the field of computer vision.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

👀

Learning with Unmasked Tokens Drives Stronger Vision Learners

Taekyung Kim, Sanghyuk Chun, Byeongho Heo, Dongyoon Han

Masked image modeling (MIM) has become a leading self-supervised learning strategy. MIMs such as Masked Autoencoder (MAE) learn strong representations by randomly masking input tokens for the encoder to process, with the decoder reconstructing the masked tokens to the input. However, MIM pre-trained encoders often exhibit a limited attention span, attributed to MIM's sole focus on regressing masked tokens only, which may impede the encoder's broader context learning. To tackle the limitation, we improve MIM by explicitly incorporating unmasked tokens into the training process. Specifically, our method enables the encoder to learn from broader context supervision, allowing unmasked tokens to experience broader contexts while the decoder reconstructs masked tokens. Thus, the encoded unmasked tokens are equipped with extensive contextual information, empowering masked tokens to leverage the enhanced unmasked tokens for MIM. As a result, our simple remedy trains more discriminative representations revealed by achieving 84.2% top-1 accuracy with ViT-B on ImageNet-1K with 0.6%p gain. We attribute the success to the enhanced pre-training method, as evidenced by the singular value spectrum and attention analyses. Finally, our models achieve significant performance gains at the downstream semantic segmentation and fine-grained visual classification tasks; and on diverse robust evaluation metrics. Code is available at https://github.com/naver-ai/lut

8/27/2024

Emerging Property of Masked Token for Effective Pre-training

Hyesong Choi, Hunsang Lee, Seyoung Joung, Hyejin Park, Jiyeong Kim, Dongbo Min

Driven by the success of Masked Language Modeling (MLM), the realm of self-supervised learning for computer vision has been invigorated by the central role of Masked Image Modeling (MIM) in driving recent breakthroughs. Notwithstanding the achievements of MIM across various downstream tasks, its overall efficiency is occasionally hampered by the lengthy duration of the pre-training phase. This paper presents a perspective that the optimization of masked tokens as a means of addressing the prevailing issue. Initially, we delve into an exploration of the inherent properties that a masked token ought to possess. Within the properties, we principally dedicated to articulating and emphasizing the `data singularity' attribute inherent in masked tokens. Through a comprehensive analysis of the heterogeneity between masked tokens and visible tokens within pre-trained models, we propose a novel approach termed masked token optimization (MTO), specifically designed to improve model efficiency through weight recalibration and the enhancement of the key property of masked tokens. The proposed method serves as an adaptable solution that seamlessly integrates into any MIM approach that leverages masked tokens. As a result, MTO achieves a considerable improvement in pre-training efficiency, resulting in an approximately 50% reduction in pre-training epochs required to attain converged performance of the recent approaches.

4/15/2024

Morphing Tokens Draw Strong Masked Image Models

Taekyung Kim, Byeongho Heo, Dongyoon Han

Masked image modeling (MIM) is a promising option for training Vision Transformers among various self-supervised learning (SSL) methods. The essence of MIM lies in token-wise masked token predictions, with targets tokenized from images or generated by pre-trained models such as vision-language models. While tokenizers or pre-trained models are plausible MIM targets, they often offer spatially inconsistent targets even for neighboring tokens, complicating models to learn unified discriminative representations. Our pilot study confirms that addressing spatial inconsistencies has the potential to enhance representation quality. Motivated by the findings, we introduce a novel self-supervision signal called Dynamic Token Morphing (DTM), which dynamically aggregates contextually related tokens to yield contextualized targets. DTM is compatible with various SSL frameworks; we showcase an improved MIM by employing DTM, barely introducing extra training costs. Our experiments on ImageNet-1K and ADE20K demonstrate the superiority of our methods compared with state-of-the-art, complex MIM methods. Furthermore, the comparative evaluation of the iNaturalists and fine-grained visual classification datasets further validates the transferability of our method on various downstream tasks. Code is available at https://github.com/naver-ai/dtm

5/3/2024

Masked Image Modeling: A Survey

Vlad Hondru, Florinel Alin Croitoru, Shervin Minaee, Radu Tudor Ionescu, Nicu Sebe

In this work, we survey recent studies on masked image modeling (MIM), an approach that emerged as a powerful self-supervised learning technique in computer vision. The MIM task involves masking some information, e.g. pixels, patches, or even latent representations, and training a model, usually an autoencoder, to predicting the missing information by using the context available in the visible part of the input. We identify and formalize two categories of approaches on how to implement MIM as a pretext task, one based on reconstruction and one based on contrastive learning. Then, we construct a taxonomy and review the most prominent papers in recent years. We complement the manually constructed taxonomy with a dendrogram obtained by applying a hierarchical clustering algorithm. We further identify relevant clusters via manually inspecting the resulting dendrogram. Our review also includes datasets that are commonly used in MIM research. We aggregate the performance results of various masked image modeling methods on the most popular datasets, to facilitate the comparison of competing methods. Finally, we identify research gaps and propose several interesting directions of future work.

8/14/2024