Masked Image Modelling for retinal OCT understanding

2405.14788

Published 5/24/2024 by Theodoros Pissas, Pablo M'arquez-Neila, Sebastian Wolf, Martin Zinkernagel, Raphael Sznitman

🖼️

Abstract

This work explores the effectiveness of masked image modelling for learning representations of retinal OCT images. To this end, we leverage Masked Autoencoders (MAE), a simple and scalable method for self-supervised learning, to obtain a powerful and general representation for OCT images by training on 700K OCT images from 41K patients collected under real world clinical settings. We also provide the first extensive evaluation for a model of OCT on a challenging battery of 6 downstream tasks. Our model achieves strong performance when fully finetuned but can also serve as a versatile frozen feature extractor for many tasks using lightweight adapters. Furthermore, we propose an extension of the MAE pretraining to fuse OCT with an auxiliary modality, namely, IR fundus images and learn a joint model for both. We demonstrate our approach improves performance on a multimodal downstream application. Our experiments utilize most publicly available OCT datasets, thus enabling future comparisons. Our code and model weights are publicly available https://github.com/TheoPis/MIM_OCT.

Create account to get full access

Overview

This paper explores the use of Masked Autoencoders (MAE) for learning representations of retinal Optical Coherence Tomography (OCT) images.
The researchers leverage MAE, a self-supervised learning method, to obtain a powerful and general representation for OCT images by training on a large dataset of 700K OCT images from 41K patients.
They provide an extensive evaluation of their model on a challenging battery of 6 downstream tasks and demonstrate strong performance.
The paper also introduces an extension of the MAE pretraining to fuse OCT with an auxiliary modality, namely, IR fundus images, and learn a joint model for both.

Plain English Explanation

In this work, the researchers explored using a technique called Masked Autoencoders (MAE) to learn useful representations from retinal OCT images. OCT is a medical imaging technique that allows doctors to see detailed images of the different layers of the retina.

The researchers trained their MAE model on a large dataset of over 700,000 OCT images from 41,000 patients. This allowed the model to learn general and powerful features from the OCT images, without needing any additional labels or information. The researchers then tested their model on a variety of downstream tasks, such as detecting eye diseases, and found that it performed very well, even when used as a feature extractor for these tasks.

Additionally, the researchers experimented with combining the OCT images with another type of medical image, called IR fundus images. By training the MAE model on both types of images, they were able to learn a joint representation that performed better on a multimodal downstream application than using just the OCT images alone.

Overall, this research shows that Masked Autoencoders can be a powerful tool for learning useful representations from medical imaging data, like OCT images, without needing a lot of labeled data. This could be helpful for developing AI systems that can assist doctors in analyzing and interpreting these types of medical images.

Technical Explanation

The researchers in this paper leverage the Masked Autoencoder (MAE) approach, a simple and scalable method for self-supervised learning, to obtain a powerful and general representation for retinal OCT images.

They train their MAE model on a large dataset of 700K OCT images from 41K patients, collected under real-world clinical settings. This allows the model to learn robust features from the data without relying on costly manual annotations.

The researchers then provide an extensive evaluation of their MAE-based model on a challenging battery of 6 downstream tasks, including disease classification, segmentation, and other clinically relevant applications. They show that their model achieves strong performance when fully fine-tuned, and can also serve as a versatile frozen feature extractor for many tasks using lightweight adapters.

Furthermore, the paper introduces an extension of the MAE pretraining to fuse OCT with an auxiliary modality, namely, IR fundus images. By learning a joint model for both imaging modalities, the researchers demonstrate improved performance on a multimodal downstream application.

The experiments in this work leverage most publicly available OCT datasets, enabling future comparisons and reproducibility of the results.

Critical Analysis

The paper provides a comprehensive evaluation of the MAE approach for learning representations from retinal OCT images, which is a valuable contribution to the field. The use of a large, real-world dataset and the thorough testing on a variety of downstream tasks lend credibility to the findings.

However, the paper does not address potential limitations or caveats of the approach. For example, it would be useful to understand the performance of the MAE model relative to other self-supervised or supervised learning methods for this task, as well as the computational and training time requirements of the approach.

Additionally, while the extension to multimodal learning with IR fundus images is interesting, the paper does not provide a detailed analysis of the benefits and drawbacks of this approach compared to using OCT data alone. Further exploration of the complementarity of the two imaging modalities could strengthen the conclusions.

Overall, this work is a valuable contribution to the literature on self-supervised learning for medical imaging, but additional analysis and comparisons could help readers better understand the strengths and limitations of the proposed approach.

Conclusion

This paper demonstrates the effectiveness of Masked Autoencoders (MAE) for learning powerful representations from retinal OCT images in a self-supervised manner. The researchers show that their MAE-based model achieves strong performance on a variety of downstream tasks, and can also serve as a versatile feature extractor.

Furthermore, the paper introduces an extension of the MAE pretraining to fuse OCT with IR fundus images, resulting in improved performance on a multimodal application. This work highlights the potential of self-supervised learning techniques, like MAE, to enable the development of effective AI systems for medical image analysis, without the need for extensive manual labeling.

Overall, this research contributes to the growing body of evidence supporting the use of self-supervised learning methods in the medical imaging domain, and could have important implications for improving clinical decision-making and patient outcomes.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🖼️

Exploring Masked Autoencoders for Sensor-Agnostic Image Retrieval in Remote Sensing

Jakob Hackstein, Gencer Sumbul, Kai Norman Clasen, Begum Demir

Self-supervised learning through masked autoencoders (MAEs) has recently attracted great attention for remote sensing (RS) image representation learning, and thus embodies a significant potential for content-based image retrieval (CBIR) from ever-growing RS image archives. However, the existing studies on MAEs in RS assume that the considered RS images are acquired by a single image sensor, and thus are only suitable for uni-modal CBIR problems. The effectiveness of MAEs for cross-sensor CBIR, which aims to search semantically similar images across different image modalities, has not been explored yet. In this paper, we take the first step to explore the effectiveness of MAEs for sensor-agnostic CBIR in RS. To this end, we present a systematic overview on the possible adaptations of the vanilla MAE to exploit masked image modeling on multi-sensor RS image archives (denoted as cross-sensor masked autoencoders [CSMAEs]). Based on different adjustments applied to the vanilla MAE, we introduce different CSMAE models. We also provide an extensive experimental analysis of these CSMAE models. We finally derive a guideline to exploit masked image modeling for uni-modal and cross-modal CBIR problems in RS. The code of this work is publicly available at https://github.com/jakhac/CSMAE.

4/12/2024

eess.IV cs.CV

Masked Autoencoders for Microscopy are Scalable Learners of Cellular Biology

Oren Kraus, Kian Kenyon-Dean, Saber Saberian, Maryam Fallah, Peter McLean, Jess Leung, Vasudev Sharma, Ayla Khan, Jia Balakrishnan, Safiye Celik, Dominique Beaini, Maciej Sypetkowski, Chi Vicky Cheng, Kristen Morse, Maureen Makes, Ben Mabey, Berton Earnshaw

Featurizing microscopy images for use in biological research remains a significant challenge, especially for large-scale experiments spanning millions of images. This work explores the scaling properties of weakly supervised classifiers and self-supervised masked autoencoders (MAEs) when training with increasingly larger model backbones and microscopy datasets. Our results show that ViT-based MAEs outperform weakly supervised classifiers on a variety of tasks, achieving as much as a 11.5% relative improvement when recalling known biological relationships curated from public databases. Additionally, we develop a new channel-agnostic MAE architecture (CA-MAE) that allows for inputting images of different numbers and orders of channels at inference time. We demonstrate that CA-MAEs effectively generalize by inferring and evaluating on a microscopy image dataset (JUMP-CP) generated under different experimental conditions with a different channel structure than our pretraining data (RPI-93M). Our findings motivate continued research into scaling self-supervised learning on microscopy data in order to create powerful foundation models of cellular biology that have the potential to catalyze advancements in drug discovery and beyond.

4/17/2024

cs.CV cs.AI cs.LG

Self-supervised Pre-training for Transferable Multi-modal Perception

Xiaohao Xu, Tianyi Zhang, Jinrong Yang, Matthew Johnson-Roberson, Xiaonan Huang

In autonomous driving, multi-modal perception models leveraging inputs from multiple sensors exhibit strong robustness in degraded environments. However, these models face challenges in efficiently and effectively transferring learned representations across different modalities and tasks. This paper presents NeRF-Supervised Masked Auto Encoder (NS-MAE), a self-supervised pre-training paradigm for transferable multi-modal representation learning. NS-MAE is designed to provide pre-trained model initializations for efficient and high-performance fine-tuning. Our approach uses masked multi-modal reconstruction in neural radiance fields (NeRF), training the model to reconstruct missing or corrupted input data across multiple modalities. Specifically, multi-modal embeddings are extracted from corrupted LiDAR point clouds and images, conditioned on specific view directions and locations. These embeddings are then rendered into projected multi-modal feature maps using neural rendering techniques. The original multi-modal signals serve as reconstruction targets for the rendered feature maps, facilitating self-supervised representation learning. Extensive experiments demonstrate the promising transferability of NS-MAE representations across diverse multi-modal and single-modal perception models. This transferability is evaluated on various 3D perception downstream tasks, such as 3D object detection and BEV map segmentation, using different amounts of fine-tuning labeled data. Our code will be released to support the community.

5/29/2024

cs.CV cs.AI cs.RO

Scaling Efficient Masked Autoencoder Learning on Large Remote Sensing Dataset

Fengxiang Wang, Hongzhen Wang, Di Wang, Zonghao Guo, Zhenyu Zhong, Long Lan, Jing Zhang, Zhiyuan Liu, Maosong Sun

Masked Image Modeling (MIM) has emerged as a pivotal approach for developing foundational visual models in the field of remote sensing (RS). However, current RS datasets are limited in volume and diversity, which significantly constrains the capacity of MIM methods to learn generalizable representations. In this study, we introduce textbf{RS-4M}, a large-scale dataset designed to enable highly efficient MIM training on RS images. RS-4M comprises 4 million optical images encompassing abundant and fine-grained RS visual tasks, including object-level detection and pixel-level segmentation. Compared to natural images, RS images often contain massive redundant background pixels, which limits the training efficiency of the conventional MIM models. To address this, we propose an efficient MIM method, termed textbf{SelectiveMAE}, which dynamically encodes and reconstructs a subset of patch tokens selected based on their semantic richness. SelectiveMAE roots in a progressive semantic token selection module, which evolves from reconstructing semantically analogical tokens to encoding complementary semantic dependencies. This approach transforms conventional MIM training into a progressive feature learning process, enabling SelectiveMAE to efficiently learn robust representations of RS images. Extensive experiments show that SelectiveMAE significantly boosts training efficiency by 2.2-2.7 times and enhances the classification, detection, and segmentation performance of the baseline MIM model.The dataset, source code, and trained models will be released.

6/19/2024

cs.CV