MMEarth: Exploring Multi-Modal Pretext Tasks For Geospatial Representation Learning

2405.02771

Published 5/7/2024 by Vishal Nedungadi, Ankit Kariryaa, Stefan Oehmcke, Serge Belongie, Christian Igel, Nico Lang

MMEarth: Exploring Multi-Modal Pretext Tasks For Geospatial Representation Learning

Abstract

The volume of unlabelled Earth observation (EO) data is huge, but many important applications lack labelled training data. However, EO data offers the unique opportunity to pair data from different modalities and sensors automatically based on geographic location and time, at virtually no human labor cost. We seize this opportunity to create a diverse multi-modal pretraining dataset at global scale. Using this new corpus of 1.2 million locations, we propose a Multi-Pretext Masked Autoencoder (MP-MAE) approach to learn general-purpose representations for optical satellite images. Our approach builds on the ConvNeXt V2 architecture, a fully convolutional masked autoencoder (MAE). Drawing upon a suite of multi-modal pretext tasks, we demonstrate that our MP-MAE approach outperforms both MAEs pretrained on ImageNet and MAEs pretrained on domain-specific satellite images. This is shown on several downstream tasks including image classification and semantic segmentation. We find that multi-modal pretraining notably improves the linear probing performance, e.g. 4pp on BigEarthNet and 16pp on So2Sat, compared to pretraining on optical satellite images only. We show that this also leads to better label and parameter efficiency which are crucial aspects in global scale applications.

Create account to get full access

Overview

• This paper, MMEarth: Exploring Multi-Modal Pretext Tasks For Geospatial Representation Learning, proposes a novel approach for learning rich representations from multi-modal Earth observation data using self-supervised pretext tasks.

• The authors explore different pretext tasks, such as masked image prediction and modality fusion, to learn generalizable features that can be applied to downstream geospatial tasks.

• The proposed framework, called MMEarth, combines visual, spectral, and contextual information from Sentinel-2 satellite imagery to learn robust and transferable representations.

Plain English Explanation

The researchers in this study wanted to find a way to automatically learn useful information from satellite images of the Earth, without the need for extensive manual labeling. They used a technique called "self-supervised learning," where the model learns by solving simulated "puzzles" or "pretext tasks" related to the data, rather than being trained on labeled examples.

Specifically, the researchers explored different pretext tasks, such as predicting missing parts of an image and fusing information from different sensor modalities, to see which ones were most effective at learning valuable representations from Sentinel-2 satellite imagery. Sentinel-2 provides a rich source of data, with both visual (color) information and spectral (wavelength) information about the Earth's surface.

The goal was to create a system that could learn powerful and generalizable features from this multi-modal data, which could then be used to tackle a variety of geospatial tasks, such as land cover classification, change detection, or disaster monitoring. By learning these representations in a self-supervised way, the model could be trained without the need for expensive and time-consuming manual labeling of the satellite images.

Technical Explanation

The MMEarth framework combines visual, spectral, and contextual information from Sentinel-2 satellite imagery to learn rich, transferable representations through self-supervised pretext tasks. The authors explore several pretext tasks, including masked image prediction, where the model tries to reconstruct missing parts of an image, and modality fusion, where the model learns to predict one modality given another.

The model architecture consists of a shared encoder network that processes the different input modalities, along with task-specific heads for the various pretext tasks. The authors experiment with different backbone architectures, such as Swin Transformers, and compare the performance of the pretext tasks on downstream geospatial tasks, including land cover classification and change detection.

The results show that the proposed MMEarth framework is effective at learning transferable representations, outperforming previous self-supervised and supervised approaches on the evaluated downstream tasks. The authors also provide insights into the impact of different pretext tasks and the importance of combining multiple modalities for effective representation learning.

Critical Analysis

The MMEarth paper presents a compelling approach for learning rich representations from multi-modal Earth observation data using self-supervised pretext tasks. The exploration of various pretext tasks and the comparison of their effectiveness on downstream geospatial tasks is a valuable contribution to the field of representation learning for remote sensing.

One potential limitation of the study is the focus on a single satellite sensor, Sentinel-2. While Sentinel-2 provides a rich source of multi-modal data, it would be interesting to see how the MMEarth framework performs when applied to a broader range of Earth observation data, including other satellite sensors or even airborne and drone-based imagery. This could help assess the generalizability of the learned representations across different geospatial data sources.

Additionally, while the paper provides insights into the impact of different pretext tasks, further investigation into the underlying reasons for their performance could lead to a deeper understanding of the representation learning process and inform the design of even more effective pretext tasks for geospatial applications.

Conclusion

The MMEarth paper presents a promising approach for learning powerful and transferable representations from multi-modal Earth observation data using self-supervised pretext tasks. By combining visual, spectral, and contextual information from Sentinel-2 satellite imagery, the authors demonstrate the effectiveness of their framework on downstream geospatial tasks, such as land cover classification and change detection.

The exploration of various pretext tasks and the insights gained from their comparative analysis contribute valuable knowledge to the field of representation learning for remote sensing. As the availability and variety of Earth observation data continue to grow, the MMEarth framework and its principles could be applied to a broader range of geospatial applications, potentially leading to more efficient and effective solutions for understanding and monitoring the Earth's environment.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

$A$^{2}$-MAE: A spatial-temporal-spectral unified remote sensing pre-training method based on anchor-aware masked autoencoder$

A$^{2}$-MAE: A spatial-temporal-spectral unified remote sensing pre-training method based on anchor-aware masked autoencoder

Lixian Zhang, Yi Zhao, Runmin Dong, Jinxiao Zhang, Shuai Yuan, Shilei Cao, Mengxuan Chen, Juepeng Zheng, Weijia Li, Wei Liu, Wayne Zhang, Litong Feng, Haohuan Fu

Vast amounts of remote sensing (RS) data provide Earth observations across multiple dimensions, encompassing critical spatial, temporal, and spectral information which is essential for addressing global-scale challenges such as land use monitoring, disaster prevention, and environmental change mitigation. Despite various pre-training methods tailored to the characteristics of RS data, a key limitation persists: the inability to effectively integrate spatial, temporal, and spectral information within a single unified model. To unlock the potential of RS data, we construct a Spatial-Temporal-Spectral Structured Dataset (STSSD) characterized by the incorporation of multiple RS sources, diverse coverage, unified locations within image sets, and heterogeneity within images. Building upon this structured dataset, we propose an Anchor-Aware Masked AutoEncoder method (A$^{2}$-MAE), leveraging intrinsic complementary information from the different kinds of images and geo-information to reconstruct the masked patches during the pre-training phase. A$^{2}$-MAE integrates an anchor-aware masking strategy and a geographic encoding module to comprehensively exploit the properties of RS images. Specifically, the proposed anchor-aware masking strategy dynamically adapts the masking process based on the meta-information of a pre-selected anchor image, thereby facilitating the training on images captured by diverse types of RS sources within one model. Furthermore, we propose a geographic encoding method to leverage accurate spatial patterns, enhancing the model generalization capabilities for downstream applications that are generally location-related. Extensive experiments demonstrate our method achieves comprehensive improvements across various downstream tasks compared with existing RS pre-training methods, including image classification, semantic segmentation, and change detection tasks.

6/18/2024

cs.CV

Multi-Label Guided Soft Contrastive Learning for Efficient Earth Observation Pretraining

Yi Wang, Conrad M Albrecht, Xiao Xiang Zhu

Self-supervised pretraining on large-scale satellite data has raised great interest in building Earth observation (EO) foundation models. However, many important resources beyond pure satellite imagery, such as land-cover-land-use products that provide free global semantic information, as well as vision foundation models that hold strong knowledge of the natural world, tend to be overlooked. In this work, we show these free additional resources not only help resolve common contrastive learning bottlenecks, but also significantly boost the efficiency and effectiveness of EO pretraining. Specifically, we first propose soft contrastive learning that optimizes cross-scene soft similarity based on land-cover-generated multi-label supervision, naturally solving the issue of multiple positive samples and too strict positive matching in complex scenes. Second, we explore cross-domain continual pretraining for both multispectral and SAR imagery, building efficient EO foundation models from strongest vision models such as DINOv2. Integrating simple weight-initialization and Siamese masking strategies into our soft contrastive learning framework, we demonstrate impressive continual pretraining performance even when the input channels and modalities are not aligned. Without prohibitive training, we produce multispectral and SAR foundation models that achieve significantly better results in 9 out of 10 downstream tasks than most existing SOTA models. For example, our ResNet50/ViT-S achieve 84.8/85.0 linear probing mAP scores on BigEarthNet-10% which are better than most existing ViT-L models; under the same setting, our ViT-B sets a new record of 86.8 in multispectral, and 82.5 in SAR, the latter even better than many multispectral models. Dataset and models are available at https://github.com/zhu-xlab/softcon.

6/3/2024

cs.CV

Self-supervised Pre-training for Transferable Multi-modal Perception

Xiaohao Xu, Tianyi Zhang, Jinrong Yang, Matthew Johnson-Roberson, Xiaonan Huang

In autonomous driving, multi-modal perception models leveraging inputs from multiple sensors exhibit strong robustness in degraded environments. However, these models face challenges in efficiently and effectively transferring learned representations across different modalities and tasks. This paper presents NeRF-Supervised Masked Auto Encoder (NS-MAE), a self-supervised pre-training paradigm for transferable multi-modal representation learning. NS-MAE is designed to provide pre-trained model initializations for efficient and high-performance fine-tuning. Our approach uses masked multi-modal reconstruction in neural radiance fields (NeRF), training the model to reconstruct missing or corrupted input data across multiple modalities. Specifically, multi-modal embeddings are extracted from corrupted LiDAR point clouds and images, conditioned on specific view directions and locations. These embeddings are then rendered into projected multi-modal feature maps using neural rendering techniques. The original multi-modal signals serve as reconstruction targets for the rendered feature maps, facilitating self-supervised representation learning. Extensive experiments demonstrate the promising transferability of NS-MAE representations across diverse multi-modal and single-modal perception models. This transferability is evaluated on various 3D perception downstream tasks, such as 3D object detection and BEV map segmentation, using different amounts of fine-tuning labeled data. Our code will be released to support the community.

5/29/2024

cs.CV cs.AI cs.RO

🔮

OmniSat: Self-Supervised Modality Fusion for Earth Observation

Guillaume Astruc, Nicolas Gonthier, Clement Mallet, Loic Landrieu

The field of Earth Observations (EO) offers a wealth of data from diverse sensors, presenting a great opportunity for advancing self-supervised multimodal learning. However, current multimodal EO datasets and models focus on a single data type, either mono-date images or time series, which limits their expressivity. We introduce OmniSat, a novel architecture that exploits the spatial alignment between multiple EO modalities to learn expressive multimodal representations without labels. To demonstrate the advantages of combining modalities of different natures, we augment two existing datasets with new modalities. As demonstrated on three downstream tasks: forestry, land cover classification, and crop mapping. OmniSat can learn rich representations in an unsupervised manner, leading to improved performance in the semi- and fully-supervised settings, even when only one modality is available for inference. The code and dataset are available at github.com/gastruc/OmniSat.

4/15/2024

cs.CV