Efficient Masked Autoencoders with Self-Consistency

2302.14431

Published 6/4/2024 by Zhaowen Li, Yousong Zhu, Zhiyang Chen, Wei Li, Chaoyang Zhao, Rui Zhao, Ming Tang, Jinqiao Wang

🤔

Abstract

Inspired by the masked language modeling (MLM) in natural language processing tasks, the masked image modeling (MIM) has been recognized as a strong self-supervised pre-training method in computer vision. However, the high random mask ratio of MIM results in two serious problems: 1) the inadequate data utilization of images within each iteration brings prolonged pre-training, and 2) the high inconsistency of predictions results in unreliable generations, $i.e.$, the prediction of the identical patch may be inconsistent in different mask rounds, leading to divergent semantics in the ultimately generated outcomes. To tackle these problems, we propose the efficient masked autoencoders with self-consistency (EMAE) to improve the pre-training efficiency and increase the consistency of MIM. In particular, we present a parallel mask strategy that divides the image into K non-overlapping parts, each of which is generated by a random mask with the same mask ratio. Then the MIM task is conducted parallelly on all parts in an iteration and the model minimizes the loss between the predictions and the masked patches. Besides, we design the self-consistency learning to further maintain the consistency of predictions of overlapping masked patches among parts. Overall, our method is able to exploit the data more efficiently and obtains reliable representations. Experiments on ImageNet show that EMAE achieves the best performance on ViT-Large with only 13% of MAE pre-training time using NVIDIA A100 GPUs. After pre-training on diverse datasets, EMAE consistently obtains state-of-the-art transfer ability on a variety of downstream tasks, such as image classification, object detection, and semantic segmentation.

Create account to get full access

Overview

Introduces a new approach called "Efficient Masked Autoencoders with Self-Consistency" (EMAE) to improve the efficiency and consistency of masked image modeling (MIM), a self-supervised pre-training method in computer vision.
Addresses two key problems with the high random mask ratio in MIM: prolonged pre-training due to inadequate data utilization, and unreliable generations due to inconsistent predictions of identical patches.
EMAE proposes a parallel mask strategy and self-consistency learning to tackle these issues.

Plain English Explanation

The paper introduces a new technique called EMAE that builds on the masked image modeling (MIM) approach used in computer vision. MIM is a powerful self-supervised pre-training method, where the model tries to predict the content of "masked" or hidden parts of an image.

However, the high random mask ratio in MIM can lead to two problems:

The model doesn't get to see enough of the image in each training iteration, which means the pre-training process takes a long time.
The model's predictions of the hidden parts can be inconsistent, leading to unreliable or divergent output when the model generates new images.

To address these issues, EMAE uses a parallel mask strategy where the image is divided into non-overlapping parts, each with the same mask ratio. This allows the model to process all parts of the image in parallel, making the pre-training more efficient.

Additionally, EMAE introduces self-consistency learning, which helps the model maintain consistency in its predictions of overlapping masked patches across the different image parts. This results in more reliable and coherent image generations.

By tackling these problems, EMAE is able to achieve state-of-the-art performance on a variety of computer vision tasks, such as image classification, object detection, and semantic segmentation, with significantly less pre-training time compared to other MIM approaches.

Technical Explanation

The paper presents the Efficient Masked Autoencoders with Self-Consistency (EMAE) approach to improve the pre-training efficiency and consistency of masked image modeling (MIM), a self-supervised learning method in computer vision.

The key elements of EMAE are:

Parallel Mask Strategy: The image is divided into K non-overlapping parts, each with the same random mask ratio. The MIM task is then performed in parallel on all parts, allowing the model to exploit the data more efficiently compared to the high random mask ratio in standard MIM.
Self-Consistency Learning: To maintain the consistency of predictions for overlapping masked patches across the different image parts, the model is trained to minimize the loss between the predictions and the actual masked patches. This helps ensure reliable and coherent image generations.

The authors evaluate EMAE on the ImageNet dataset and show that it achieves the best performance on the ViT-Large model, with only 13% of the pre-training time required by the standard MIM approach using NVIDIA A100 GPUs.

After pre-training on diverse datasets, EMAE is able to consistently obtain state-of-the-art transfer learning performance on a variety of downstream tasks, including image classification, object detection, and semantic segmentation.

Critical Analysis

The paper presents a compelling approach to address the key limitations of the high random mask ratio in MIM pre-training. The parallel mask strategy and self-consistency learning effectively tackle the issues of prolonged pre-training and unreliable generations, respectively.

One potential limitation not discussed in the paper is the scalability of the parallel mask approach. While it improves efficiency for a single image, it's unclear how well it would scale to larger batches or higher-resolution images, which are common in real-world computer vision tasks.

Additionally, the paper does not provide a detailed analysis of the trade-offs between the increased computational complexity of the parallel mask strategy and the gains in pre-training efficiency. This could be an area for further investigation.

Another aspect that could be explored is the impact of the self-consistency learning on the learned representations. While the paper shows improved performance on downstream tasks, a deeper examination of the latent space and its properties could shed light on the benefits of this approach compared to standard MIM.

Despite these minor limitations, the EMAE technique represents a significant contribution to the field of self-supervised learning in computer vision, and the insights gained could inspire further advancements in masked autoencoders, self-supervised vision transformers, and other related areas.

Conclusion

The paper introduces the Efficient Masked Autoencoders with Self-Consistency (EMAE) approach, which addresses key limitations of the high random mask ratio in the popular masked image modeling (MIM) self-supervised pre-training method for computer vision.

By proposing a parallel mask strategy and self-consistency learning, EMAE is able to significantly improve the pre-training efficiency and the consistency of the model's predictions, leading to state-of-the-art performance on a variety of downstream tasks with much less pre-training time.

This work represents an important step forward in advancing self-supervised learning techniques for computer vision, and the insights gained could have broader implications for interpretable and performant self-supervised models as well as sensor-agnostic image retrieval.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

📉

i-MAE: Are Latent Representations in Masked Autoencoders Linearly Separable?

Kevin Zhang, Zhiqiang Shen

Masked image modeling (MIM) has been recognized as a strong self-supervised pre-training approach in the vision domain. However, the mechanism and properties of the learned representations by such a scheme, as well as how to further enhance the representations are so far not well-explored. In this paper, we aim to explore an interactive Masked Autoencoders (i-MAE) framework to enhance the representation capability from two aspects: (1) employing a two-way image reconstruction and a latent feature reconstruction with distillation loss to learn better features; (2) proposing a semantics-enhanced sampling strategy to boost the learned semantics in MAE. Upon the proposed i-MAE architecture, we can address two critical questions to explore the behaviors of the learned representations in MAE: (1) Whether the separability of latent representations in Masked Autoencoders is helpful for model performance? We study it by forcing the input as a mixture of two images instead of one. (2) Whether we can enhance the representations in the latent feature space by controlling the degree of semantics during sampling on Masked Autoencoders? To this end, we propose a sampling strategy within a mini-batch based on the semantics of training samples to examine this aspect. Extensive experiments are conducted on CIFAR-10/100, Tiny-ImageNet and ImageNet-1K to verify the observations we discovered. Furthermore, in addition to qualitatively analyzing the characteristics of the latent representations, we examine the existence of linear separability and the degree of semantics in the latent space by proposing two evaluation schemes. The surprising and consistent results demonstrate that i-MAE is a superior framework design for understanding MAE frameworks, as well as achieving better representational ability. Code is available at https://github.com/vision-learning-acceleration-lab/i-mae.

4/10/2024

cs.CV cs.AI cs.LG

Self Pre-training with Topology- and Spatiality-aware Masked Autoencoders for 3D Medical Image Segmentation

Pengfei Gu, Yejia Zhang, Huimin Li, Hongxiao Wang, Yizhe Zhang, Chaoli Wang, Danny Z. Chen

Masked Autoencoders (MAEs) have been shown to be effective in pre-training Vision Transformers (ViTs) for natural and medical image analysis problems. By reconstructing missing pixel/voxel information in visible patches, a ViT encoder can aggregate contextual information for downstream tasks. But, existing MAE pre-training methods, which were specifically developed with the ViT architecture, lack the ability to capture geometric shape and spatial information, which is critical for medical image segmentation tasks. In this paper, we propose a novel extension of known MAEs for self pre-training (i.e., models pre-trained on the same target dataset) for 3D medical image segmentation. (1) We propose a new topological loss to preserve geometric shape information by computing topological signatures of both the input and reconstructed volumes, learning geometric shape information. (2) We introduce a pre-text task that predicts the positions of the centers and eight corners of 3D crops, enabling the MAE to aggregate spatial information. (3) We extend the MAE pre-training strategy to a hybrid state-of-the-art (SOTA) medical image segmentation architecture and co-pretrain it alongside the ViT. (4) We develop a fine-tuned model for downstream segmentation tasks by complementing the pre-trained ViT encoder with our pre-trained SOTA model. Extensive experiments on five public 3D segmentation datasets show the effectiveness of our new approach.

6/18/2024

cs.CV cs.AI

SCE-MAE: Selective Correspondence Enhancement with Masked Autoencoder for Self-Supervised Landmark Estimation

Kejia Yin, Varshanth R. Rao, Ruowei Jiang, Xudong Liu, Parham Aarabi, David B. Lindell

Self-supervised landmark estimation is a challenging task that demands the formation of locally distinct feature representations to identify sparse facial landmarks in the absence of annotated data. To tackle this task, existing state-of-the-art (SOTA) methods (1) extract coarse features from backbones that are trained with instance-level self-supervised learning (SSL) paradigms, which neglect the dense prediction nature of the task, (2) aggregate them into memory-intensive hypercolumn formations, and (3) supervise lightweight projector networks to naively establish full local correspondences among all pairs of spatial features. In this paper, we introduce SCE-MAE, a framework that (1) leverages the MAE, a region-level SSL method that naturally better suits the landmark prediction task, (2) operates on the vanilla feature map instead of on expensive hypercolumns, and (3) employs a Correspondence Approximation and Refinement Block (CARB) that utilizes a simple density peak clustering algorithm and our proposed Locality-Constrained Repellence Loss to directly hone only select local correspondences. We demonstrate through extensive experiments that SCE-MAE is highly effective and robust, outperforming existing SOTA methods by large margins of approximately 20%-44% on the landmark matching and approximately 9%-15% on the landmark detection tasks.

5/29/2024

cs.CV cs.AI

👀

Visualizing the loss landscape of Self-supervised Vision Transformer

Youngwan Lee, Jeffrey Ryan Willette, Jonghee Kim, Sung Ju Hwang

The Masked autoencoder (MAE) has drawn attention as a representative self-supervised approach for masked image modeling with vision transformers. However, even though MAE shows better generalization capability than fully supervised training from scratch, the reason why has not been explored. In another line of work, the Reconstruction Consistent Masked Auto Encoder (RC-MAE), has been proposed which adopts a self-distillation scheme in the form of an exponential moving average (EMA) teacher into MAE, and it has been shown that the EMA-teacher performs a conditional gradient correction during optimization. To further investigate the reason for better generalization of the self-supervised ViT when trained by MAE (MAE-ViT) and the effect of the gradient correction of RC-MAE from the perspective of optimization, we visualize the loss landscapes of the self-supervised vision transformer by both MAE and RC-MAE and compare them with the supervised ViT (Sup-ViT). Unlike previous loss landscape visualizations of neural networks based on classification task loss, we visualize the loss landscape of ViT by computing pre-training task loss. Through the lens of loss landscapes, we find two interesting observations: (1) MAE-ViT has a smoother and wider overall loss curvature than Sup-ViT. (2) The EMA-teacher allows MAE to widen the region of convexity in both pretraining and linear probing, leading to quicker convergence. To the best of our knowledge, this work is the first to investigate the self-supervised ViT through the lens of the loss landscape.

5/29/2024

cs.CV cs.LG