CromSS: Cross-modal pre-training with noisy labels for remote sensing image segmentation

Read original: arXiv:2405.01217 - Published 5/3/2024 by Chenying Liu, Conrad Albrecht, Yi Wang, Xiao Xiang Zhu

🖼️

Overview

The paper explores the potential of using noisy labels, which may contain errors, to pretrain semantic segmentation models for geospatial applications.
The researchers propose a novel method called Cross-modal Sample Selection (CromSS) that leverages the class distributions modeled by multiple sensors/modalities to determine the reliability of the noisy labels.
The method is evaluated using Sentinel-1 (radar) and Sentinel-2 (optical) satellite imagery from the SSL4EO-S12 dataset, paired with noisy labels from the Google Dynamic World project.
Transfer learning evaluations on the DFC2020 dataset confirm the effectiveness of the proposed approach for remote sensing image segmentation.

Plain English Explanation

In this research, the team explores a way to use noisy labels - labels that may contain errors - to help train semantic segmentation models for geospatial applications. Semantic segmentation is the process of dividing an image into meaningful regions, like roads, buildings, or vegetation.

The researchers developed a new method called Cross-modal Sample Selection (CromSS). This method looks at the class distributions - the probability that each pixel belongs to a certain class - as measured by multiple sensors or modalities, like radar and optical satellite imagery. It uses the consistency of these class distributions across sensors to determine which noisy labels are more reliable.

To test their approach, the team used Sentinel-1 (radar) and Sentinel-2 (optical) satellite images from the SSL4EO-S12 dataset. They paired these images with noisy labels from the Google Dynamic World project, which may not be perfectly accurate.

When they tested their method on a different dataset, the DFC2020 dataset, they found that it was effective at improving the performance of the semantic segmentation models. This suggests that their approach of using noisy labels in a multi-modal learning framework can be a useful technique for remote sensing image analysis.

Technical Explanation

The paper proposes a novel Cross-modal Sample Selection (CromSS) method that leverages the class distributions P^{(d)}(x,c) over pixels x and classes c modelled by multiple sensors/modalities d of a given geospatial scene. The consistency of predictions across sensors d is jointly informed by the entropy of P^{(d)}(x,c).

The researchers determine the noisy label sampling based on the confidence of each sensor d in the noisy class label, P^{(d)}(x,c=y(x)). This allows them to identify the more reliable noisy labels and use them to pretrain the semantic segmentation models.

To evaluate their approach, the team conducts experiments using Sentinel-1 (radar) and Sentinel-2 (optical) satellite imagery from the SSL4EO-S12 dataset. They pair this data with 9-class noisy labels sourced from the Google Dynamic World project.

The transfer learning evaluations on the DFC2020 dataset confirm the effectiveness of the proposed CromSS method for remote sensing image segmentation.

Critical Analysis

The paper presents a novel approach to leveraging noisy labels in a multi-modal learning framework for semantic segmentation of geospatial data. The proposed CromSS method provides a principled way to identify the more reliable noisy labels and use them to pretrain the segmentation models.

One potential limitation of the study is the reliance on the Google Dynamic World dataset for the noisy labels. While this dataset provides broad geospatial coverage, the quality and accuracy of the labels may vary across different regions and land cover types. It would be interesting to see how the CromSS method performs with other sources of noisy labels or with a more comprehensive evaluation of label quality.

Additionally, the paper does not provide a detailed analysis of the failure cases or the types of errors in the noisy labels that the CromSS method is able to overcome. A more in-depth investigation into the performance on specific land cover classes could yield additional insights into the strengths and limitations of the proposed approach.

Overall, the research presents a promising direction for leveraging noisy labels in multi-modal learning for geospatial applications. The CromSS method could be a valuable tool for researchers and practitioners working on semantic segmentation tasks in the remote sensing domain.

Conclusion

This paper explores a novel approach to using noisy labels for pretraining semantic segmentation models in a multi-modal learning framework for geospatial applications. The proposed CromSS method leverages the class distributions and prediction consistency across multiple sensors to identify the more reliable noisy labels and use them effectively for model pretraining.

The team's experiments with Sentinel-1 and Sentinel-2 satellite imagery, paired with noisy labels from the Google Dynamic World project, demonstrate the effectiveness of the CromSS method for remote sensing image segmentation. This research could have important implications for the broader field of geospatial data analysis, where noisy or incomplete labels are a common challenge.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🖼️

CromSS: Cross-modal pre-training with noisy labels for remote sensing image segmentation

Chenying Liu, Conrad Albrecht, Yi Wang, Xiao Xiang Zhu

We study the potential of noisy labels y to pretrain semantic segmentation models in a multi-modal learning framework for geospatial applications. Specifically, we propose a novel Cross-modal Sample Selection method (CromSS) that utilizes the class distributions P^{(d)}(x,c) over pixels x and classes c modelled by multiple sensors/modalities d of a given geospatial scene. Consistency of predictions across sensors $d$ is jointly informed by the entropy of P^{(d)}(x,c). Noisy label sampling we determine by the confidence of each sensor d in the noisy class label, P^{(d)}(x,c=y(x)). To verify the performance of our approach, we conduct experiments with Sentinel-1 (radar) and Sentinel-2 (optical) satellite imagery from the globally-sampled SSL4EO-S12 dataset. We pair those scenes with 9-class noisy labels sourced from the Google Dynamic World project for pretraining. Transfer learning evaluations (downstream task) on the DFC2020 dataset confirm the effectiveness of the proposed method for remote sensing image segmentation.

5/3/2024

Task Specific Pretraining with Noisy Labels for Remote sensing Image Segmentation

Chenying Liu, Conrad M Albrecht, Yi Wang, Xiao Xiang Zhu

Compared to supervised deep learning, self-supervision provides remote sensing a tool to reduce the amount of exact, human-crafted geospatial annotations. While image-level information for unsupervised pretraining efficiently works for various classification downstream tasks, the performance on pixel-level semantic segmentation lags behind in terms of model accuracy. On the contrary, many easily available label sources (e.g., automatic labeling tools and land cover land use products) exist, which can provide a large amount of noisy labels for segmentation model training. In this work, we propose to exploit noisy semantic segmentation maps for model pretraining. Our experiments provide insights on robustness per network layer. The transfer learning settings test the cases when the pretrained encoders are fine-tuned for different label classes and decoders. The results from two datasets indicate the effectiveness of task-specific supervised pretraining with noisy labels. Our findings pave new avenues to improved model accuracy and novel pretraining strategies for efficient remote sensing image segmentation.

6/11/2024

Cross-Modal Denoising: A Novel Training Paradigm for Enhancing Speech-Image Retrieval

Lifeng Zhou, Yuke Li, Rui Deng, Yuting Yang, Haoqi Zhu

The success of speech-image retrieval relies on establishing an effective alignment between speech and image. Existing methods often model cross-modal interaction through simple cosine similarity of the global feature of each modality, which fall short in capturing fine-grained details within modalities. To address this issue, we introduce an effective framework and a novel learning task named cross-modal denoising (CMD) to enhance cross-modal interaction to achieve finer-level cross-modal alignment. Specifically, CMD is a denoising task designed to reconstruct semantic features from noisy features within one modality by interacting features from another modality. Notably, CMD operates exclusively during model training and can be removed during inference without adding extra inference time. The experimental results demonstrate that our framework outperforms the state-of-the-art method by 2.0% in mean R@1 on the Flickr8k dataset and by 1.7% in mean R@1 on the SpokenCOCO dataset for the speech-image retrieval tasks, respectively. These experimental results validate the efficiency and effectiveness of our framework.

9/12/2024

🏋️

Cross-sensor self-supervised training and alignment for remote sensing

Valerio Marsocci (CEDRIC - VERTIGO, CNAM), Nicolas Audebert (CEDRIC - VERTIGO, CNAM, LaSTIG, IGN)

Large-scale foundation models have gained traction as a way to leverage the vast amounts of unlabeled remote sensing data collected every day. However, due to the multiplicity of Earth Observation satellites, these models should learn sensor agnostic representations, that generalize across sensor characteristics with minimal fine-tuning. This is complicated by data availability, as low-resolution imagery, such as Sentinel-2 and Landsat-8 data, are available in large amounts, while very high-resolution aerial or satellite data is less common. To tackle these challenges, we introduce cross-sensor self-supervised training and alignment for remote sensing (X-STARS). We design a self-supervised training loss, the Multi-Sensor Alignment Dense loss (MSAD), to align representations across sensors, even with vastly different resolutions. Our X-STARS can be applied to train models from scratch, or to adapt large models pretrained on e.g low-resolution EO data to new high-resolution sensors, in a continual pretraining framework. We collect and release MSC-France, a new multi-sensor dataset, on which we train our X-STARS models, then evaluated on seven downstream classification and segmentation tasks. We demonstrate that X-STARS outperforms the state-of-the-art by a significant margin with less data across various conditions of data availability and resolutions.

5/17/2024