EyeFound: A Multimodal Generalist Foundation Model for Ophthalmic Imaging

2405.11338

Published 5/24/2024 by Danli Shi, Weiyi Zhang, Xiaolan Chen, Yexin Liu, Jiancheng Yang, Siyu Huang, Yih Chung Tham, Yingfeng Zheng, Mingguang He

cs.CV cs.AI

📈

Abstract

Artificial intelligence (AI) is vital in ophthalmology, tackling tasks like diagnosis, classification, and visual question answering (VQA). However, existing AI models in this domain often require extensive annotation and are task-specific, limiting their clinical utility. While recent developments have brought about foundation models for ophthalmology, they are limited by the need to train separate weights for each imaging modality, preventing a comprehensive representation of multi-modal features. This highlights the need for versatile foundation models capable of handling various tasks and modalities in ophthalmology. To address this gap, we present EyeFound, a multimodal foundation model for ophthalmic images. Unlike existing models, EyeFound learns generalizable representations from unlabeled multimodal retinal images, enabling efficient model adaptation across multiple applications. Trained on 2.78 million images from 227 hospitals across 11 ophthalmic modalities, EyeFound facilitates generalist representations and diverse multimodal downstream tasks, even for detecting challenging rare diseases. It outperforms previous work RETFound in diagnosing eye diseases, predicting systemic disease incidents, and zero-shot multimodal VQA. EyeFound provides a generalizable solution to improve model performance and lessen the annotation burden on experts, facilitating widespread clinical AI applications for retinal imaging.

Create account to get full access

Overview

• Artificial intelligence (AI) is crucial in ophthalmology, tackling tasks like diagnosis, classification, and visual question answering (VQA).

• Existing AI models in this domain often require extensive annotation and are task-specific, limiting their clinical utility.

• Recent developments have brought about foundation models for ophthalmology, but they are limited by the need to train separate weights for each imaging modality, preventing a comprehensive representation of multi-modal features.

• The paper presents EyeFound, a multimodal foundation model for ophthalmic images that addresses these limitations.

Plain English Explanation

• AI is essential in eye care, helping with tasks like identifying eye conditions, sorting images, and answering questions about what's seen in images.

• Current AI models used in eye care often need a lot of labeled data and can only do one specific task, which limits how useful they are in the real world.

• New developments have created "foundation models" for eye care, but these models still need to be trained separately for each type of eye image, preventing them from fully understanding the connections between different types of eye images.

• The EyeFound model presented in this paper addresses these issues by learning general representations from many different types of unlabeled eye images. This allows the model to be easily adapted to handle various eye care tasks and work with different types of eye images.

Technical Explanation

• EyeFound is trained on 2.78 million images from 227 hospitals across 11 ophthalmic modalities, enabling it to learn generalizable representations for diverse multimodal downstream tasks.

• Unlike previous work like RETFound, EyeFound does not require training separate weights for each imaging modality, allowing it to better capture the interconnections between different types of eye images.

• EyeFound outperforms RETFound in diagnosing eye diseases, predicting systemic disease incidents, and zero-shot multimodal VQA, demonstrating its ability to handle a wide range of ophthalmic applications.

Critical Analysis

• The paper acknowledges that while EyeFound provides a generalizable solution, it may still be limited by the quality and diversity of the training data used.

• The authors also note that further research is needed to explore the model's capabilities in handling rare or under-represented eye conditions, as well as its performance on real-world clinical data.

• Potential issues that could be explored include the model's interpretability, its robustness to distribution shift, and its ability to handle noisy or incomplete input data in a clinical setting.

Conclusion

• EyeFound presents a promising multimodal foundation model that can help improve the performance and lessen the annotation burden of AI systems in ophthalmology.

• By learning generalizable representations from diverse eye images, EyeFound enables efficient model adaptation across multiple applications, including the detection of challenging rare diseases.

• This research highlights the potential for versatile foundation models to drive widespread clinical AI applications in eye care and other specialized medical domains, such as chest CT imaging.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🏋️

Training a high-performance retinal foundation model with half-the-data and 400 times less compute

Justin Engelmann, Miguel O. Bernabeu

Artificial Intelligence holds tremendous potential in medicine, but is traditionally limited by the lack of massive datasets to train models on. Foundation models, pre-trained models that can be adapted to downstream tasks with small datasets, could alleviate this problem. Researchers at Moorfields Eye Hospital (MEH) proposed RETFound-MEH, a foundation model for retinal imaging that was trained on 900,000 images, including private hospital data. Recently, data-efficient DERETFound was proposed that provides comparable performance while being trained on only 150,000 images that are all publicly available. However, both these models required very substantial resources to train initially and are resource-intensive in downstream use. We propose a novel Token Reconstruction objective that we use to train RETFound-Green, a retinal foundation model trained using only 75,000 publicly available images and 400 times less compute. We estimate the cost of training RETFound-MEH and DERETFound at $10,000 and $14,000, respectively, while RETFound-Green could be trained for less than $100, with equally reduced environmental impact. RETFound-Green is also far more efficient in downstream use: it can be downloaded 14 times faster, computes vector embeddings 2.7 times faster which then require 2.6 times less storage space. Despite this, RETFound-Green does not perform systematically worse. In fact, it performs best on 14 tasks, compared to six for DERETFound and two for RETFound-MEH. Our results suggest that RETFound-Green is a very efficient, high-performance retinal foundation model. We anticipate that our Token Reconstruction objective could be scaled up for even higher performance and be applied to other domains beyond retinal imaging.

5/2/2024

cs.CV cs.AI

📊

Confidence-aware multi-modality learning for eye disease screening

Ke Zou, Tian Lin, Zongbo Han, Meng Wang, Xuedong Yuan, Haoyu Chen, Changqing Zhang, Xiaojing Shen, Huazhu Fu

Multi-modal ophthalmic image classification plays a key role in diagnosing eye diseases, as it integrates information from different sources to complement their respective performances. However, recent improvements have mainly focused on accuracy, often neglecting the importance of confidence and robustness in predictions for diverse modalities. In this study, we propose a novel multi-modality evidential fusion pipeline for eye disease screening. It provides a measure of confidence for each modality and elegantly integrates the multi-modality information using a multi-distribution fusion perspective. Specifically, our method first utilizes normal inverse gamma prior distributions over pre-trained models to learn both aleatoric and epistemic uncertainty for uni-modality. Then, the normal inverse gamma distribution is analyzed as the Student's t distribution. Furthermore, within a confidence-aware fusion framework, we propose a mixture of Student's t distributions to effectively integrate different modalities, imparting the model with heavy-tailed properties and enhancing its robustness and reliability. More importantly, the confidence-aware multi-modality ranking regularization term induces the model to more reasonably rank the noisy single-modal and fused-modal confidence, leading to improved reliability and accuracy. Experimental results on both public and internal datasets demonstrate that our model excels in robustness, particularly in challenging scenarios involving Gaussian noise and modality missing conditions. Moreover, our model exhibits strong generalization capabilities to out-of-distribution data, underscoring its potential as a promising solution for multimodal eye disease screening.

5/29/2024

eess.IV cs.CV

One for All: Toward Unified Foundation Models for Earth Vision

Zhitong Xiong, Yi Wang, Fahong Zhang, Xiao Xiang Zhu

Foundation models characterized by extensive parameters and trained on large-scale datasets have demonstrated remarkable efficacy across various downstream tasks for remote sensing data. Current remote sensing foundation models typically specialize in a single modality or a specific spatial resolution range, limiting their versatility for downstream datasets. While there have been attempts to develop multi-modal remote sensing foundation models, they typically employ separate vision encoders for each modality or spatial resolution, necessitating a switch in backbones contingent upon the input data. To address this issue, we introduce a simple yet effective method, termed OFA-Net (One-For-All Network): employing a single, shared Transformer backbone for multiple data modalities with different spatial resolutions. Using the masked image modeling mechanism, we pre-train a single Transformer backbone on a curated multi-modal dataset with this simple design. Then the backbone model can be used in different downstream tasks, thus forging a path towards a unified foundation backbone model in Earth vision. The proposed method is evaluated on 12 distinct downstream tasks and demonstrates promising performance.

5/29/2024

cs.CV

Eye-gaze Guided Multi-modal Alignment Framework for Radiology

Chong Ma, Hanqi Jiang, Wenting Chen, Yiwei Li, Zihao Wu, Xiaowei Yu, Zhengliang Liu, Lei Guo, Dajiang Zhu, Tuo Zhang, Dinggang Shen, Tianming Liu, Xiang Li

In the medical multi-modal frameworks, the alignment of cross-modality features presents a significant challenge. However, existing works have learned features that are implicitly aligned from the data, without considering the explicit relationships in the medical context. This data-reliance may lead to low generalization of the learned alignment relationships. In this work, we propose the Eye-gaze Guided Multi-modal Alignment (EGMA) framework to harness eye-gaze data for better alignment of medical visual and textual features. We explore the natural auxiliary role of radiologists' eye-gaze data in aligning medical images and text, and introduce a novel approach by using eye-gaze data, collected synchronously by radiologists during diagnostic evaluations. We conduct downstream tasks of image classification and image-text retrieval on four medical datasets, where EGMA achieved state-of-the-art performance and stronger generalization across different datasets. Additionally, we explore the impact of varying amounts of eye-gaze data on model performance, highlighting the feasibility and utility of integrating this auxiliary data into multi-modal alignment framework.

6/17/2024

cs.CV cs.CL