Self Pre-training with Topology- and Spatiality-aware Masked Autoencoders for 3D Medical Image Segmentation

2406.10519

Published 6/18/2024 by Pengfei Gu, Yejia Zhang, Huimin Li, Hongxiao Wang, Yizhe Zhang, Chaoli Wang, Danny Z. Chen

Self Pre-training with Topology- and Spatiality-aware Masked Autoencoders for 3D Medical Image Segmentation

Abstract

Masked Autoencoders (MAEs) have been shown to be effective in pre-training Vision Transformers (ViTs) for natural and medical image analysis problems. By reconstructing missing pixel/voxel information in visible patches, a ViT encoder can aggregate contextual information for downstream tasks. But, existing MAE pre-training methods, which were specifically developed with the ViT architecture, lack the ability to capture geometric shape and spatial information, which is critical for medical image segmentation tasks. In this paper, we propose a novel extension of known MAEs for self pre-training (i.e., models pre-trained on the same target dataset) for 3D medical image segmentation. (1) We propose a new topological loss to preserve geometric shape information by computing topological signatures of both the input and reconstructed volumes, learning geometric shape information. (2) We introduce a pre-text task that predicts the positions of the centers and eight corners of 3D crops, enabling the MAE to aggregate spatial information. (3) We extend the MAE pre-training strategy to a hybrid state-of-the-art (SOTA) medical image segmentation architecture and co-pretrain it alongside the ViT. (4) We develop a fine-tuned model for downstream segmentation tasks by complementing the pre-trained ViT encoder with our pre-trained SOTA model. Extensive experiments on five public 3D segmentation datasets show the effectiveness of our new approach.

Create account to get full access

Overview

• This paper introduces a new method for 3D medical image segmentation using self-supervised pre-training with masked autoencoders that are aware of the topology and spatiality of the 3D data.

• The proposed approach aims to learn useful representations from unlabeled 3D medical images through self-supervised pre-training, which can then be fine-tuned for downstream 3D segmentation tasks.

• The key innovations include incorporating topological and spatial awareness into the masked autoencoder architecture, as well as a novel pre-training strategy that leverages both local and global context.

Plain English Explanation

The paper describes a new way to train artificial intelligence (AI) models to segment, or outline, different structures in 3D medical images like CT scans or MRIs. The key idea is to first have the AI model learn general features from lots of unlabeled 3D medical images through a process called self-supervised pre-training.

During this pre-training, the model tries to predict parts of the 3D images that have been randomly hidden or "masked" out. But unlike typical masked autoencoders, this model is designed to also capture the 3D shape and spatial relationships of the medical structures. The authors hypothesize that this will allow the model to learn more meaningful representations that can then be fine-tuned for the specific task of 3D medical image segmentation.

The benefit of this approach is that it can leverage the vast amount of unlabeled 3D medical data that exists to train powerful AI models, without requiring expensive manual labeling of all that data. The learned representations can then be adapted to work well on downstream segmentation tasks with much less labeled data.

Technical Explanation

The core of the proposed method is a masked autoencoder architecture that is designed to be topology-aware and spatiality-aware. This is achieved through several key innovations:

Topology-Aware Masking: Instead of randomly masking voxels in the 3D input, the model uses a topology-preserving masking strategy that maintains the overall 3D shape structure.
Spatial Context Aggregation: The encoder and decoder networks leverage spatial-temporal encoding techniques to capture both local and global spatial contexts.
Dual-Scale Prediction: The model predicts the masked voxels at both a local and global scale, encouraging it to learn representations that capture both fine-grained details and high-level 3D structure.

During pre-training, the model is trained to reconstruct the masked voxels in the input 3D medical images. This self-supervised learning process allows the model to learn useful representations without requiring any manual labeling.

The pre-trained model can then be fine-tuned on downstream 3D segmentation tasks by adding a task-specific prediction head. The authors demonstrate the effectiveness of this approach on several 3D medical image segmentation benchmarks, showing improvements over previous state-of-the-art methods.

Critical Analysis

The paper makes a strong case for the benefits of incorporating topological and spatial awareness into masked autoencoders for 3D medical image analysis. The authors' innovations, such as the topology-preserving masking strategy and dual-scale prediction, seem well-justified and are supported by the empirical results.

However, one potential limitation is the computational complexity of the proposed approach, which may make it challenging to apply to very large 3D medical volumes. The authors do not provide much discussion of the model's runtime or memory requirements.

Additionally, the paper does not explore the model's robustness to different types of 3D medical data or its generalization to other 3D computer vision tasks beyond segmentation. Further research could investigate the transferability of the learned representations to a broader range of 3D domains.

Overall, this work represents an interesting and promising step towards more effective self-supervised learning for 3D medical image analysis, with the potential to reduce the need for costly manual labeling in this important application area.

Conclusion

The proposed self-supervised pre-training approach with topology- and spatiality-aware masked autoencoders demonstrates strong performance on 3D medical image segmentation tasks. By leveraging the large amounts of unlabeled 3D medical data through self-supervised learning, this method can learn powerful representations that can then be fine-tuned for specific downstream segmentation tasks with much less labeled data.

The key innovations, such as the topology-preserving masking strategy and dual-scale prediction, allow the model to capture the inherent 3D structure and spatial relationships in medical images. This represents an important step towards more effective self-supervised learning for 3D computer vision, with potential applications beyond just medical image analysis.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

✨

3D Feature Prediction for Masked-AutoEncoder-Based Point Cloud Pretraining

Siming Yan, Yuqi Yang, Yuxiao Guo, Hao Pan, Peng-shuai Wang, Xin Tong, Yang Liu, Qixing Huang

Masked autoencoders (MAE) have recently been introduced to 3D self-supervised pretraining for point clouds due to their great success in NLP and computer vision. Unlike MAEs used in the image domain, where the pretext task is to restore features at the masked pixels, such as colors, the existing 3D MAE works reconstruct the missing geometry only, i.e, the location of the masked points. In contrast to previous studies, we advocate that point location recovery is inessential and restoring intrinsic point features is much superior. To this end, we propose to ignore point position reconstruction and recover high-order features at masked points including surface normals and surface variations, through a novel attention-based decoder which is independent of the encoder design. We validate the effectiveness of our pretext task and decoder design using different encoder structures for 3D training and demonstrate the advantages of our pretrained networks on various point cloud analysis tasks.

4/30/2024

cs.CV

🏷️

Spatio-Temporal Encoding of Brain Dynamics with Surface Masked Autoencoders

Simon Dahan, Logan Z. J. Williams, Yourong Guo, Daniel Rueckert, Emma C. Robinson

The development of robust and generalisable models for encoding the spatio-temporal dynamics of human brain activity is crucial for advancing neuroscientific discoveries. However, significant individual variation in the organisation of the human cerebral cortex makes it difficult to identify population-level trends in these signals. Recently, Surface Vision Transformers (SiTs) have emerged as a promising approach for modelling cortical signals, yet they face some limitations in low-data scenarios due to the lack of inductive biases in their architecture. To address these challenges, this paper proposes the surface Masked AutoEncoder (sMAE) and video surface Masked AutoEncoder (vsMAE) - for multivariate and spatio-temporal pre-training of cortical signals over regular icosahedral grids. These models are trained to reconstruct cortical feature maps from masked versions of the input by learning strong latent representations of cortical structure and function. Such representations translate into better modelling of individual phenotypes and enhanced performance in downstream tasks. The proposed approach was evaluated on cortical phenotype regression using data from the young adult Human Connectome Project (HCP) and developing HCP (dHCP). Results show that (v)sMAE pre-trained models improve phenotyping prediction performance on multiple tasks by $ge 26%$, and offer faster convergence relative to models trained from scratch. Finally, we show that pre-training vision transformers on large datasets, such as the UK Biobank (UKB), supports transfer learning to low-data regimes. Our code and pre-trained models are publicly available at https://github.com/metrics-lab/surface-masked-autoencoders .

6/12/2024

eess.IV cs.CV

Spatial-Temporal-Decoupled Masked Pre-training for Spatiotemporal Forecasting

Haotian Gao, Renhe Jiang, Zheng Dong, Jinliang Deng, Yuxin Ma, Xuan Song

Spatiotemporal forecasting techniques are significant for various domains such as transportation, energy, and weather. Accurate prediction of spatiotemporal series remains challenging due to the complex spatiotemporal heterogeneity. In particular, current end-to-end models are limited by input length and thus often fall into spatiotemporal mirage, i.e., similar input time series followed by dissimilar future values and vice versa. To address these problems, we propose a novel self-supervised pre-training framework Spatial-Temporal-Decoupled Masked Pre-training (STD-MAE) that employs two decoupled masked autoencoders to reconstruct spatiotemporal series along the spatial and temporal dimensions. Rich-context representations learned through such reconstruction could be seamlessly integrated by downstream predictors with arbitrary architectures to augment their performances. A series of quantitative and qualitative evaluations on six widely used benchmarks (PEMS03, PEMS04, PEMS07, PEMS08, METR-LA, and PEMS-BAY) are conducted to validate the state-of-the-art performance of STD-MAE. Codes are available at https://github.com/Jimmy-7664/STD-MAE.

4/30/2024

cs.LG

$A$^{2}$-MAE: A spatial-temporal-spectral unified remote sensing pre-training method based on anchor-aware masked autoencoder$

A$^{2}$-MAE: A spatial-temporal-spectral unified remote sensing pre-training method based on anchor-aware masked autoencoder

Lixian Zhang, Yi Zhao, Runmin Dong, Jinxiao Zhang, Shuai Yuan, Shilei Cao, Mengxuan Chen, Juepeng Zheng, Weijia Li, Wei Liu, Wayne Zhang, Litong Feng, Haohuan Fu

Vast amounts of remote sensing (RS) data provide Earth observations across multiple dimensions, encompassing critical spatial, temporal, and spectral information which is essential for addressing global-scale challenges such as land use monitoring, disaster prevention, and environmental change mitigation. Despite various pre-training methods tailored to the characteristics of RS data, a key limitation persists: the inability to effectively integrate spatial, temporal, and spectral information within a single unified model. To unlock the potential of RS data, we construct a Spatial-Temporal-Spectral Structured Dataset (STSSD) characterized by the incorporation of multiple RS sources, diverse coverage, unified locations within image sets, and heterogeneity within images. Building upon this structured dataset, we propose an Anchor-Aware Masked AutoEncoder method (A$^{2}$-MAE), leveraging intrinsic complementary information from the different kinds of images and geo-information to reconstruct the masked patches during the pre-training phase. A$^{2}$-MAE integrates an anchor-aware masking strategy and a geographic encoding module to comprehensively exploit the properties of RS images. Specifically, the proposed anchor-aware masking strategy dynamically adapts the masking process based on the meta-information of a pre-selected anchor image, thereby facilitating the training on images captured by diverse types of RS sources within one model. Furthermore, we propose a geographic encoding method to leverage accurate spatial patterns, enhancing the model generalization capabilities for downstream applications that are generally location-related. Extensive experiments demonstrate our method achieves comprehensive improvements across various downstream tasks compared with existing RS pre-training methods, including image classification, semantic segmentation, and change detection tasks.

6/18/2024

cs.CV