CorrMAE: Pre-training Correspondence Transformers with Masked Autoencoder

Read original: arXiv:2406.05773 - Published 6/11/2024 by Tangfei Liao, Xiaoqin Zhang, Guobao Xiao, Min Li, Tao Wang, Mang Ye

CorrMAE: Pre-training Correspondence Transformers with Masked Autoencoder

Overview

CorrMAE is a pre-training method that uses a Masked Autoencoder (MAE) approach to learn correspondence representations between visual and textual inputs.
The key idea is to mask out random patches in the visual input and have the model reconstruct the missing patches, while also learning to align the reconstructed visual input with the corresponding text.
This pre-training approach aims to capture the correspondence between visual and textual data, which can be beneficial for downstream tasks that require understanding the relationship between the two modalities.

Plain English Explanation

In CorrMAE: Pre-training Correspondence Transformers with Masked Autoencoder, the researchers developed a new way to train AI models to understand the connection between visual and textual information.

The main idea is to randomly hide or "mask" parts of an image, and then have the model try to fill in the missing pieces. At the same time, the model also learns to match the reconstructed image with the corresponding text description. By training the model to do both of these tasks together, it can learn to better understand the relationship between what it sees in an image and what is described in the accompanying text.

This approach, called a "Masked Autoencoder" (MAE), is designed to help the model develop a deeper understanding of the correspondence between visual and textual data. The researchers believe this can be very useful for tasks that require combining information from both modalities, like image captioning or visual question answering.

Technical Explanation

In the CorrMAE paper, the authors propose a pre-training method that leverages a Masked Autoencoder (MAE) approach to learn correspondence representations between visual and textual inputs.

The core idea is to randomly mask out patches in the visual input and have the model reconstruct the missing patches. Crucially, the model is also trained to align the reconstructed visual input with the corresponding text. This joint training objective encourages the model to learn representations that capture the correspondence between the two modalities.

The CorrMAE architecture consists of a visual encoder, a text encoder, and a shared decoder that is responsible for reconstructing the masked visual patches. The visual and text encoders learn modality-specific representations, while the shared decoder facilitates cross-modal alignment.

The authors demonstrate the effectiveness of CorrMAE through experiments on various downstream tasks, including image-text retrieval, visual question answering, and zero-shot cross-modal transfer. The results show that the CorrMAE pre-training approach outperforms alternative methods, particularly in settings where the target task requires understanding the relationship between visual and textual information.

Critical Analysis

The CorrMAE paper presents a novel pre-training approach that focuses on learning correspondence representations between visual and textual data. This is an important problem, as many real-world applications require the ability to integrate information from multiple modalities.

One potential limitation of the CorrMAE approach is that it may be sensitive to the quality and alignment of the visual-textual data used for pre-training. If the data has significant noise or misalignment, this could negatively impact the learned representations and their transferability to downstream tasks.

Additionally, the paper does not explore the scalability of the CorrMAE approach to larger-scale datasets or more complex visual and textual inputs. Further research would be needed to understand the limits of this approach and how it might be adapted to handle more challenging scenarios.

Overall, the CorrMAE paper presents an interesting and potentially valuable contribution to the field of multi-modal learning. However, as with any research, it is important to consider the potential limitations and areas for further exploration.

Conclusion

The CorrMAE paper introduces a novel pre-training approach that leverages a Masked Autoencoder (MAE) to learn correspondence representations between visual and textual data. By jointly training the model to reconstruct masked visual patches and align the reconstructed input with the corresponding text, the CorrMAE method aims to capture the relationship between the two modalities.

The results of the paper demonstrate the effectiveness of the CorrMAE approach for various downstream tasks that require understanding the connection between visual and textual information. This work represents an important step towards developing AI systems that can seamlessly integrate and reason about multimodal data, which has numerous applications in fields such as image captioning, visual question answering, and cross-modal retrieval.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

CorrMAE: Pre-training Correspondence Transformers with Masked Autoencoder

Tangfei Liao, Xiaoqin Zhang, Guobao Xiao, Min Li, Tao Wang, Mang Ye

Pre-training has emerged as a simple yet powerful methodology for representation learning across various domains. However, due to the expensive training cost and limited data, pre-training has not yet been extensively studied in correspondence pruning. To tackle these challenges, we propose a pre-training method to acquire a generic inliers-consistent representation by reconstructing masked correspondences, providing a strong initial representation for downstream tasks. Toward this objective, a modicum of true correspondences naturally serve as input, thus significantly reducing pre-training overhead. In practice, we introduce CorrMAE, an extension of the mask autoencoder framework tailored for the pre-training of correspondence pruning. CorrMAE involves two main phases, ie correspondence learning and matching point reconstruction, guiding the reconstruction of masked correspondences through learning visible correspondence consistency. Herein, we employ a dual-branch structure with an ingenious positional encoding to reconstruct unordered and irregular correspondences. Also, a bi-level designed encoder is proposed for correspondence learning, which offers enhanced consistency learning capability and transferability. Extensive experiments have shown that the model pre-trained with our CorrMAE outperforms prior work on multiple challenging benchmarks. Meanwhile, our CorrMAE is primarily a task-driven pre-training method, and can achieve notable improvements for downstream tasks by pre-training on the targeted dataset. We hope this work can provide a starting point for correspondence pruning pre-training.

6/11/2024

SCE-MAE: Selective Correspondence Enhancement with Masked Autoencoder for Self-Supervised Landmark Estimation

Kejia Yin, Varshanth R. Rao, Ruowei Jiang, Xudong Liu, Parham Aarabi, David B. Lindell

Self-supervised landmark estimation is a challenging task that demands the formation of locally distinct feature representations to identify sparse facial landmarks in the absence of annotated data. To tackle this task, existing state-of-the-art (SOTA) methods (1) extract coarse features from backbones that are trained with instance-level self-supervised learning (SSL) paradigms, which neglect the dense prediction nature of the task, (2) aggregate them into memory-intensive hypercolumn formations, and (3) supervise lightweight projector networks to naively establish full local correspondences among all pairs of spatial features. In this paper, we introduce SCE-MAE, a framework that (1) leverages the MAE, a region-level SSL method that naturally better suits the landmark prediction task, (2) operates on the vanilla feature map instead of on expensive hypercolumns, and (3) employs a Correspondence Approximation and Refinement Block (CARB) that utilizes a simple density peak clustering algorithm and our proposed Locality-Constrained Repellence Loss to directly hone only select local correspondences. We demonstrate through extensive experiments that SCE-MAE is highly effective and robust, outperforming existing SOTA methods by large margins of approximately 20%-44% on the landmark matching and approximately 9%-15% on the landmark detection tasks.

5/29/2024

Efficient Image Pre-Training with Siamese Cropped Masked Autoencoders

Alexandre Eymael, Renaud Vandeghen, Anthony Cioppa, Silvio Giancola, Bernard Ghanem, Marc Van Droogenbroeck

Self-supervised pre-training of image encoders is omnipresent in the literature, particularly following the introduction of Masked autoencoders (MAE). Current efforts attempt to learn object-centric representations from motion in videos. In particular, SiamMAE recently introduced a Siamese network, training a shared-weight encoder from two frames of a video with a high asymmetric masking ratio (95%). In this work, we propose CropMAE, an alternative approach to the Siamese pre-training introduced by SiamMAE. Our method specifically differs by exclusively considering pairs of cropped images sourced from the same image but cropped differently, deviating from the conventional pairs of frames extracted from a video. CropMAE therefore alleviates the need for video datasets, while maintaining competitive performances and drastically reducing pre-training and learning time. Furthermore, we demonstrate that CropMAE learns similar object-centric representations without explicit motion, showing that current self-supervised learning methods do not learn such representations from explicit object motion, but rather thanks to the implicit image transformations that occur between the two views. Finally, CropMAE achieves the highest masking ratio to date (98.5%), enabling the reconstruction of images using only two visible patches. Our code is available at https://github.com/alexandre-eymael/CropMAE.

7/19/2024

✨

3D Feature Prediction for Masked-AutoEncoder-Based Point Cloud Pretraining

Siming Yan, Yuqi Yang, Yuxiao Guo, Hao Pan, Peng-shuai Wang, Xin Tong, Yang Liu, Qixing Huang

Masked autoencoders (MAE) have recently been introduced to 3D self-supervised pretraining for point clouds due to their great success in NLP and computer vision. Unlike MAEs used in the image domain, where the pretext task is to restore features at the masked pixels, such as colors, the existing 3D MAE works reconstruct the missing geometry only, i.e, the location of the masked points. In contrast to previous studies, we advocate that point location recovery is inessential and restoring intrinsic point features is much superior. To this end, we propose to ignore point position reconstruction and recover high-order features at masked points including surface normals and surface variations, through a novel attention-based decoder which is independent of the encoder design. We validate the effectiveness of our pretext task and decoder design using different encoder structures for 3D training and demonstrate the advantages of our pretrained networks on various point cloud analysis tasks.

4/30/2024