UrFound: Towards Universal Retinal Foundation Models via Knowledge-Guided Masked Modeling

Read original: arXiv:2408.05618 - Published 8/13/2024 by Kai Yu, Yang Zhou, Yang Bai, Zhi Da Soh, Xinxing Xu, Rick Siow Mong Goh, Ching-Yu Cheng, Yong Liu

UrFound: Towards Universal Retinal Foundation Models via Knowledge-Guided Masked Modeling

Overview

UrFound is a novel approach to developing a universal retinal foundation model using knowledge-guided masked modeling.
The key idea is to leverage domain-specific knowledge to guide the pretraining of a multimodal foundation model for retinal image understanding.
This allows the model to learn rich representations that generalize well to various downstream retinal analysis tasks.

Plain English Explanation

The paper presents the UrFound model, which aims to create a universal retinal foundation model that can be used for a wide range of retinal image analysis tasks.

The core innovation is the use of knowledge-guided masked modeling during pretraining. This means the model is trained to predict parts of the input retinal images that have been deliberately masked out, but it is guided by domain-specific knowledge about the structure and features of the retina.

This allows the model to learn rich and generalizable representations of retinal images, which can then be fine-tuned for various downstream applications like disease diagnosis, anatomical segmentation, and more. The key benefit is that the model can be adapted to many different retinal analysis tasks without having to start from scratch each time.

Technical Explanation

The UrFound model is built on top of a multimodal vision-language backbone, which allows it to process both visual (e.g., retinal images) and textual (e.g., medical reports) inputs.

During pretraining, the model uses a knowledge-guided masked modeling approach, where parts of the input retinal images are randomly masked out, and the model is trained to predict these masked regions. Crucially, this prediction process is guided by domain-specific knowledge about the anatomy and characteristics of the retina, which is encoded in the model's architecture and loss function.

The knowledge-guided aspect allows the model to learn meaningful and generalizable representations of retinal structures and features, rather than just memorizing low-level patterns. This facilitates effective transfer learning to a variety of downstream retinal analysis tasks.

The paper demonstrates the effectiveness of the UrFound model through extensive experiments on multiple retinal image datasets and tasks, showing strong performance compared to previous state-of-the-art approaches.

Critical Analysis

The UrFound paper presents a compelling approach to developing a universal retinal foundation model, but there are a few potential limitations and areas for further research:

Scalability and Generalization: While the model shows strong performance on the evaluated tasks, it remains to be seen how well it will scale and generalize to a wider range of retinal analysis problems, especially those involving rare or complex pathologies.
Interpretability: The use of masked modeling and domain-specific knowledge introduces some complexity to the model, which may make it more difficult to interpret the learned representations and understand the model's decision-making process.
Data Efficiency: The paper does not explicitly address the data efficiency of the knowledge-guided pretraining approach. It would be interesting to see how the model performs when faced with limited training data for downstream tasks.
Real-world Deployment: The paper focuses on academic benchmarks, but the deployment of such a model in a clinical setting would likely require additional considerations around interpretability, robustness, and integration with existing workflows.

Overall, the UrFound model represents an exciting step forward in the development of universal retinal foundation models, but further research and validation will be needed to fully realize its potential.

Conclusion

The UrFound paper presents a novel approach to building a universal retinal foundation model using knowledge-guided masked modeling. By leveraging domain-specific knowledge during pretraining, the model is able to learn rich and generalizable representations of retinal images that can be effectively transferred to a variety of downstream tasks.

This work has the potential to significantly streamline and improve the development of retinal image analysis tools, enabling more efficient and accurate diagnosis, monitoring, and treatment of various eye diseases and conditions. As the field of retinal image understanding continues to evolve, the UrFound model represents an important step towards the creation of truly universal and versatile foundation models for this domain.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

UrFound: Towards Universal Retinal Foundation Models via Knowledge-Guided Masked Modeling

Kai Yu, Yang Zhou, Yang Bai, Zhi Da Soh, Xinxing Xu, Rick Siow Mong Goh, Ching-Yu Cheng, Yong Liu

Retinal foundation models aim to learn generalizable representations from diverse retinal images, facilitating label-efficient model adaptation across various ophthalmic tasks. Despite their success, current retinal foundation models are generally restricted to a single imaging modality, such as Color Fundus Photography (CFP) or Optical Coherence Tomography (OCT), limiting their versatility. Moreover, these models may struggle to fully leverage expert annotations and overlook the valuable domain knowledge essential for domain-specific representation learning. To overcome these limitations, we introduce UrFound, a retinal foundation model designed to learn universal representations from both multimodal retinal images and domain knowledge. UrFound is equipped with a modality-agnostic image encoder and accepts either CFP or OCT images as inputs. To integrate domain knowledge into representation learning, we encode expert annotation in text supervision and propose a knowledge-guided masked modeling strategy for model pre-training. It involves reconstructing randomly masked patches of retinal images while predicting masked text tokens conditioned on the corresponding retinal image. This approach aligns multimodal images and textual expert annotations within a unified latent space, facilitating generalizable and domain-specific representation learning. Experimental results demonstrate that UrFound exhibits strong generalization ability and data efficiency when adapting to various tasks in retinal image analysis. By training on ~180k retinal images, UrFound significantly outperforms the state-of-the-art retinal foundation model trained on up to 1.6 million unlabelled images across 8 public retinal datasets. Our code and data are available at https://github.com/yukkai/UrFound.

8/13/2024

📈

EyeFound: A Multimodal Generalist Foundation Model for Ophthalmic Imaging

Danli Shi, Weiyi Zhang, Xiaolan Chen, Yexin Liu, Jiancheng Yang, Siyu Huang, Yih Chung Tham, Yingfeng Zheng, Mingguang He

Artificial intelligence (AI) is vital in ophthalmology, tackling tasks like diagnosis, classification, and visual question answering (VQA). However, existing AI models in this domain often require extensive annotation and are task-specific, limiting their clinical utility. While recent developments have brought about foundation models for ophthalmology, they are limited by the need to train separate weights for each imaging modality, preventing a comprehensive representation of multi-modal features. This highlights the need for versatile foundation models capable of handling various tasks and modalities in ophthalmology. To address this gap, we present EyeFound, a multimodal foundation model for ophthalmic images. Unlike existing models, EyeFound learns generalizable representations from unlabeled multimodal retinal images, enabling efficient model adaptation across multiple applications. Trained on 2.78 million images from 227 hospitals across 11 ophthalmic modalities, EyeFound facilitates generalist representations and diverse multimodal downstream tasks, even for detecting challenging rare diseases. It outperforms previous work RETFound in diagnosing eye diseases, predicting systemic disease incidents, and zero-shot multimodal VQA. EyeFound provides a generalizable solution to improve model performance and lessen the annotation burden on experts, facilitating widespread clinical AI applications for retinal imaging.

5/24/2024

🏋️

Training a high-performance retinal foundation model with half-the-data and 400 times less compute

Justin Engelmann, Miguel O. Bernabeu

Artificial Intelligence holds tremendous potential in medicine, but is traditionally limited by the lack of massive datasets to train models on. Foundation models, pre-trained models that can be adapted to downstream tasks with small datasets, could alleviate this problem. Researchers at Moorfields Eye Hospital (MEH) proposed RETFound-MEH, a foundation model for retinal imaging that was trained on 900,000 images, including private hospital data. Recently, data-efficient DERETFound was proposed that provides comparable performance while being trained on only 150,000 images that are all publicly available. However, both these models required very substantial resources to train initially and are resource-intensive in downstream use. We propose a novel Token Reconstruction objective that we use to train RETFound-Green, a retinal foundation model trained using only 75,000 publicly available images and 400 times less compute. We estimate the cost of training RETFound-MEH and DERETFound at $10,000 and $14,000, respectively, while RETFound-Green could be trained for less than $100, with equally reduced environmental impact. RETFound-Green is also far more efficient in downstream use: it can be downloaded 14 times faster, computes vector embeddings 2.7 times faster which then require 2.6 times less storage space. Despite this, RETFound-Green does not perform systematically worse. In fact, it performs best on 14 tasks, compared to six for DERETFound and two for RETFound-MEH. Our results suggest that RETFound-Green is a very efficient, high-performance retinal foundation model. We anticipate that our Token Reconstruction objective could be scaled up for even higher performance and be applied to other domains beyond retinal imaging.

5/2/2024

MM-Retinal: Knowledge-Enhanced Foundational Pretraining with Fundus Image-Text Expertise

Ruiqi Wu, Chenran Zhang, Jianle Zhang, Yi Zhou, Tao Zhou, Huazhu Fu

Current fundus image analysis models are predominantly built for specific tasks relying on individual datasets. The learning process is usually based on data-driven paradigm without prior knowledge, resulting in poor transferability and generalizability. To address this issue, we propose MM-Retinal, a multi-modal dataset that encompasses high-quality image-text pairs collected from professional fundus diagram books. Moreover, enabled by MM-Retinal, we present a novel Knowledge-enhanced foundational pretraining model which incorporates Fundus Image-Text expertise, called KeepFIT. It is designed with image similarity-guided text revision and mixed training strategy to infuse expert knowledge. Our proposed fundus foundation model achieves state-of-the-art performance across six unseen downstream tasks and holds excellent generalization ability in zero-shot and few-shot scenarios. MM-Retinal and KeepFIT are available at https://github.com/lxirich/MM-Retinal.

5/21/2024