Efficient Image Pre-Training with Siamese Cropped Masked Autoencoders

Read original: arXiv:2403.17823 - Published 7/19/2024 by Alexandre Eymael, Renaud Vandeghen, Anthony Cioppa, Silvio Giancola, Bernard Ghanem, Marc Van Droogenbroeck

Efficient Image Pre-Training with Siamese Cropped Masked Autoencoders

Overview

This paper introduces an efficient image pre-training method called Siamese Cropped Masked Autoencoders (SCMA).
SCMA builds on previous work on Masked Autoencoders by incorporating a Siamese network architecture and cropped masking strategy.
The goal is to learn robust visual representations that can be applied to downstream tasks like video segmentation and label propagation.

Plain English Explanation

The researchers have developed a new way to pre-train image models called Siamese Cropped Masked Autoencoders (SCMA). This builds on previous work on Masked Autoencoders, which involve training models to fill in missing parts of images.

SCMA uses a Siamese network architecture, which means it has two copies of the same neural network that process the same input in parallel. The networks are trained to produce similar representations for similar inputs. SCMA also uses a cropped masking strategy, where random patches of the input image are hidden, and the model has to predict what's behind the masked areas.

The goal is to learn robust visual representations that can be used as a starting point for other computer vision tasks, like video segmentation and label propagation. The Siamese architecture and cropped masking are designed to help the model learn features that generalize well to these downstream applications.

Technical Explanation

The key components of the SCMA method are:

Siamese Network Architecture: The model consists of two copies of the same convolutional neural network encoder that process the input image and output feature representations. During training, the two networks share weights and are trained to produce similar representations for pairs of similar input images.
Cropped Masking Strategy: Random rectangular patches of the input image are masked out, and the model is trained to predict the missing pixel values in those regions. This encourages the model to learn features that capture the overall structure and semantics of the image, rather than just memorizing local textures.
Pre-training and Fine-tuning: The SCMA model is first pre-trained on a large dataset of unlabeled images using the Siamese cropped masking objective. It is then fine-tuned on downstream tasks like video segmentation and label propagation by adding task-specific heads to the pre-trained encoder.

The authors show that SCMA outperforms other self-supervised pre-training methods like Self-Supervised Correspondence and Social Masked Autoencoder on a range of benchmarks. They attribute this to the Siamese architecture and cropped masking strategy, which help the model learn more generalizable visual representations.

Critical Analysis

The authors provide a thorough experimental evaluation of SCMA and its comparison to other state-of-the-art self-supervised pre-training methods. However, the paper does not delve into the potential limitations or broader implications of this approach.

One concern is the computational and memory overhead of the Siamese architecture, which requires training two copies of the same network. This could limit the scalability of SCMA, especially for large-scale models and datasets. The authors could have explored ways to reduce this overhead, such as using a shared backbone with task-specific heads.

Additionally, the paper focuses on the performance of SCMA on specific downstream tasks like video segmentation and label propagation. It would be valuable to understand how well the learned representations generalize to other computer vision problems, and whether the method is sensitive to the choice of downstream task.

Finally, the paper does not discuss the potential biases or fairness issues that could arise from the self-supervised pre-training process. As these models are increasingly deployed in real-world applications, it is important to consider their societal impact and potential for harmful biases.

Conclusion

The Siamese Cropped Masked Autoencoders (SCMA) proposed in this paper offer an efficient and effective way to pre-train image models for a variety of downstream computer vision tasks. By combining a Siamese network architecture with a cropped masking strategy, the method learns robust visual representations that can be fine-tuned for applications like video segmentation and label propagation.

The strong performance of SCMA compared to other self-supervised pre-training approaches suggests that the Siamese structure and cropped masking are valuable additions to the toolkit of self-supervised learning. As the field of computer vision continues to advance, techniques like SCMA will play an important role in developing models that can generalize well to diverse real-world scenarios.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Efficient Image Pre-Training with Siamese Cropped Masked Autoencoders

Alexandre Eymael, Renaud Vandeghen, Anthony Cioppa, Silvio Giancola, Bernard Ghanem, Marc Van Droogenbroeck

Self-supervised pre-training of image encoders is omnipresent in the literature, particularly following the introduction of Masked autoencoders (MAE). Current efforts attempt to learn object-centric representations from motion in videos. In particular, SiamMAE recently introduced a Siamese network, training a shared-weight encoder from two frames of a video with a high asymmetric masking ratio (95%). In this work, we propose CropMAE, an alternative approach to the Siamese pre-training introduced by SiamMAE. Our method specifically differs by exclusively considering pairs of cropped images sourced from the same image but cropped differently, deviating from the conventional pairs of frames extracted from a video. CropMAE therefore alleviates the need for video datasets, while maintaining competitive performances and drastically reducing pre-training and learning time. Furthermore, we demonstrate that CropMAE learns similar object-centric representations without explicit motion, showing that current self-supervised learning methods do not learn such representations from explicit object motion, but rather thanks to the implicit image transformations that occur between the two views. Finally, CropMAE achieves the highest masking ratio to date (98.5%), enabling the reconstruction of images using only two visible patches. Our code is available at https://github.com/alexandre-eymael/CropMAE.

7/19/2024

🤔

Efficient Masked Autoencoders with Self-Consistency

Zhaowen Li, Yousong Zhu, Zhiyang Chen, Wei Li, Chaoyang Zhao, Rui Zhao, Ming Tang, Jinqiao Wang

Inspired by the masked language modeling (MLM) in natural language processing tasks, the masked image modeling (MIM) has been recognized as a strong self-supervised pre-training method in computer vision. However, the high random mask ratio of MIM results in two serious problems: 1) the inadequate data utilization of images within each iteration brings prolonged pre-training, and 2) the high inconsistency of predictions results in unreliable generations, $i.e.$, the prediction of the identical patch may be inconsistent in different mask rounds, leading to divergent semantics in the ultimately generated outcomes. To tackle these problems, we propose the efficient masked autoencoders with self-consistency (EMAE) to improve the pre-training efficiency and increase the consistency of MIM. In particular, we present a parallel mask strategy that divides the image into K non-overlapping parts, each of which is generated by a random mask with the same mask ratio. Then the MIM task is conducted parallelly on all parts in an iteration and the model minimizes the loss between the predictions and the masked patches. Besides, we design the self-consistency learning to further maintain the consistency of predictions of overlapping masked patches among parts. Overall, our method is able to exploit the data more efficiently and obtains reliable representations. Experiments on ImageNet show that EMAE achieves the best performance on ViT-Large with only 13% of MAE pre-training time using NVIDIA A100 GPUs. After pre-training on diverse datasets, EMAE consistently obtains state-of-the-art transfer ability on a variety of downstream tasks, such as image classification, object detection, and semantic segmentation.

6/4/2024

T-MAE: Temporal Masked Autoencoders for Point Cloud Representation Learning

Weijie Wei, Fatemeh Karimi Nejadasl, Theo Gevers, Martin R. Oswald

The scarcity of annotated data in LiDAR point cloud understanding hinders effective representation learning. Consequently, scholars have been actively investigating efficacious self-supervised pre-training paradigms. Nevertheless, temporal information, which is inherent in the LiDAR point cloud sequence, is consistently disregarded. To better utilize this property, we propose an effective pre-training strategy, namely Temporal Masked Auto-Encoders (T-MAE), which takes as input temporally adjacent frames and learns temporal dependency. A SiamWCA backbone, containing a Siamese encoder and a windowed cross-attention (WCA) module, is established for the two-frame input. Considering that the movement of an ego-vehicle alters the view of the same instance, temporal modeling also serves as a robust and natural data augmentation, enhancing the comprehension of target objects. SiamWCA is a powerful architecture but heavily relies on annotated data. Our T-MAE pre-training strategy alleviates its demand for annotated data. Comprehensive experiments demonstrate that T-MAE achieves the best performance on both Waymo and ONCE datasets among competitive self-supervised approaches. Codes will be released at https://github.com/codename1995/T-MAE

7/23/2024

Self Pre-training with Topology- and Spatiality-aware Masked Autoencoders for 3D Medical Image Segmentation

Pengfei Gu, Yejia Zhang, Huimin Li, Chaoli Wang, Danny Z. Chen

Masked Autoencoders (MAEs) have been shown to be effective in pre-training Vision Transformers (ViTs) for natural and medical image analysis problems. By reconstructing missing pixel/voxel information in visible patches, a ViT encoder can aggregate contextual information for downstream tasks. But, existing MAE pre-training methods, which were specifically developed with the ViT architecture, lack the ability to capture geometric shape and spatial information, which is critical for medical image segmentation tasks. In this paper, we propose a novel extension of known MAEs for self pre-training (i.e., models pre-trained on the same target dataset) for 3D medical image segmentation. (1) We propose a new topological loss to preserve geometric shape information by computing topological signatures of both the input and reconstructed volumes, learning geometric shape information. (2) We introduce a pre-text task that predicts the positions of the centers and eight corners of 3D crops, enabling the MAE to aggregate spatial information. (3) We extend the MAE pre-training strategy to a hybrid state-of-the-art (SOTA) medical image segmentation architecture and co-pretrain it alongside the ViT. (4) We develop a fine-tuned model for downstream segmentation tasks by complementing the pre-trained ViT encoder with our pre-trained SOTA model. Extensive experiments on five public 3D segmentation datasets show the effectiveness of our new approach.

7/17/2024