MM-Retinal: Knowledge-Enhanced Foundational Pretraining with Fundus Image-Text Expertise

Read original: arXiv:2405.11793 - Published 5/21/2024 by Ruiqi Wu, Chenran Zhang, Jianle Zhang, Yi Zhou, Tao Zhou, Huazhu Fu

MM-Retinal: Knowledge-Enhanced Foundational Pretraining with Fundus Image-Text Expertise

Overview

This paper presents a new dataset called "MM-Retinal" which combines fundus images (images of the back of the eye) with corresponding textual descriptions.
The authors use this dataset to train a knowledge-enhanced foundation model, which is a type of large language model that has been pre-trained on a diverse set of data to gain broad understanding.
The goal is to create a model that can effectively process and understand both visual and textual information related to the eye and eye diseases.

Plain English Explanation

The human eye is a complex organ, and understanding eye health and diseases requires expertise in both visual information (like images of the eye) and textual information (like medical reports). The researchers behind this paper have created a new dataset called "MM-Retinal" that combines fundus images - pictures of the back of the eye - with corresponding text descriptions.

By using this dataset to train a "knowledge-enhanced foundation model", the researchers have developed an AI system that can understand both the visual and textual information related to the eye. This could be incredibly useful for applications like diagnosing eye conditions or analyzing medical images.

The key idea is that by pre-training this model on a wide variety of data, it can develop a broad, flexible understanding that allows it to effectively process and reason about information related to eye health, even if it hasn't seen that exact information before. This multimodal approach integrating both visual and textual data is an important step towards creating AI systems that can truly understand and assist with complex medical tasks.

Technical Explanation

The authors introduce the "MM-Retinal" dataset, which contains over 100,000 fundus images paired with corresponding textual descriptions. This dataset was compiled from publicly available sources and covers a wide range of eye conditions and pathologies.

Using this dataset, the researchers train a knowledge-enhanced foundation model - a large language model that has been pre-trained on a diverse set of data to gain broad understanding. Specifically, they use a vision-language model architecture that can process both image and text inputs.

The key innovation is the use of a "knowledge injection" technique, where the model is further pre-trained on a set of carefully curated medical texts and images related to ophthalmology. This allows the model to build specialized expertise in understanding and reasoning about eye health on top of its general foundational knowledge.

The authors evaluate the performance of their model on a variety of tasks, including fundus image classification, retinal lesion segmentation, and multimodal question answering. They demonstrate that their knowledge-enhanced model outperforms both standard vision and language models, as well as previous state-of-the-art approaches.

Critical Analysis

One potential limitation of this research is the reliance on a curated dataset of fundus images and text descriptions. While the MM-Retinal dataset is quite large, it may not capture the full diversity of real-world eye health data that a deployed system would need to handle.

Additionally, the authors do not provide a detailed analysis of the types of knowledge and reasoning that their model has learned through the knowledge injection process. It would be helpful to understand more about the specific insights and capabilities that this approach has enabled.

That said, the overall approach of leveraging foundational knowledge and domain-specific expertise is a promising direction for developing robust and capable AI systems for medical applications. Further research in this area could lead to significant advancements in the field of computer-assisted diagnosis and treatment.

Conclusion

This paper presents a novel dataset and technique for training a knowledge-enhanced foundation model for understanding and reasoning about eye health. By combining fundus images with corresponding textual descriptions, the researchers have created a rich multimodal resource for developing AI systems that can effectively process both visual and textual information related to the eye.

The resulting model demonstrates strong performance on a variety of ophthalmic tasks, highlighting the potential of this approach to drive progress in medical AI. As the field continues to evolve, further innovations in areas like explainable AI and cross-dataset generalization will be crucial for building trustworthy and versatile AI systems that can truly assist healthcare professionals and patients.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

MM-Retinal: Knowledge-Enhanced Foundational Pretraining with Fundus Image-Text Expertise

Ruiqi Wu, Chenran Zhang, Jianle Zhang, Yi Zhou, Tao Zhou, Huazhu Fu

Current fundus image analysis models are predominantly built for specific tasks relying on individual datasets. The learning process is usually based on data-driven paradigm without prior knowledge, resulting in poor transferability and generalizability. To address this issue, we propose MM-Retinal, a multi-modal dataset that encompasses high-quality image-text pairs collected from professional fundus diagram books. Moreover, enabled by MM-Retinal, we present a novel Knowledge-enhanced foundational pretraining model which incorporates Fundus Image-Text expertise, called KeepFIT. It is designed with image similarity-guided text revision and mixed training strategy to infuse expert knowledge. Our proposed fundus foundation model achieves state-of-the-art performance across six unseen downstream tasks and holds excellent generalization ability in zero-shot and few-shot scenarios. MM-Retinal and KeepFIT are available at https://github.com/lxirich/MM-Retinal.

5/21/2024

A Disease-Specific Foundation Model Using Over 100K Fundus Images: Release and Validation for Abnormality and Multi-Disease Classification on Downstream Tasks

Boa Jang, Youngbin Ahn, Eun Kyung Choe, Chang Ki Yoon, Hyuk Jin Choi, Young-Gon Kim

Artificial intelligence applied to retinal images offers significant potential for recognizing signs and symptoms of retinal conditions and expediting the diagnosis of eye diseases and systemic disorders. However, developing generalized artificial intelligence models for medical data often requires a large number of labeled images representing various disease signs, and most models are typically task-specific, focusing on major retinal diseases. In this study, we developed a Fundus-Specific Pretrained Model (Image+Fundus), a supervised artificial intelligence model trained to detect abnormalities in fundus images. A total of 57,803 images were used to develop this pretrained model, which achieved superior performance across various downstream tasks, indicating that our proposed model outperforms other general methods. Our Image+Fundus model offers a generalized approach to improve model performance while reducing the number of labeled datasets required. Additionally, it provides more disease-specific insights into fundus images, with visualizations generated by our model. These disease-specific foundation models are invaluable in enhancing the performance and efficiency of deep learning models in the field of fundus imaging.

8/19/2024

UrFound: Towards Universal Retinal Foundation Models via Knowledge-Guided Masked Modeling

Kai Yu, Yang Zhou, Yang Bai, Zhi Da Soh, Xinxing Xu, Rick Siow Mong Goh, Ching-Yu Cheng, Yong Liu

Retinal foundation models aim to learn generalizable representations from diverse retinal images, facilitating label-efficient model adaptation across various ophthalmic tasks. Despite their success, current retinal foundation models are generally restricted to a single imaging modality, such as Color Fundus Photography (CFP) or Optical Coherence Tomography (OCT), limiting their versatility. Moreover, these models may struggle to fully leverage expert annotations and overlook the valuable domain knowledge essential for domain-specific representation learning. To overcome these limitations, we introduce UrFound, a retinal foundation model designed to learn universal representations from both multimodal retinal images and domain knowledge. UrFound is equipped with a modality-agnostic image encoder and accepts either CFP or OCT images as inputs. To integrate domain knowledge into representation learning, we encode expert annotation in text supervision and propose a knowledge-guided masked modeling strategy for model pre-training. It involves reconstructing randomly masked patches of retinal images while predicting masked text tokens conditioned on the corresponding retinal image. This approach aligns multimodal images and textual expert annotations within a unified latent space, facilitating generalizable and domain-specific representation learning. Experimental results demonstrate that UrFound exhibits strong generalization ability and data efficiency when adapting to various tasks in retinal image analysis. By training on ~180k retinal images, UrFound significantly outperforms the state-of-the-art retinal foundation model trained on up to 1.6 million unlabelled images across 8 public retinal datasets. Our code and data are available at https://github.com/yukkai/UrFound.

8/13/2024

ViLReF: A Chinese Vision-Language Retinal Foundation Model

Shengzhu Yang, Jiawei Du, Jia Guo, Weihang Zhang, Hanruo Liu, Huiqi Li, Ningli Wang

Subtle semantic differences in retinal image and text data present great challenges for pre-training visual-language models. Moreover, false negative samples, i.e., image-text pairs having the same semantics but incorrectly regarded as negatives, disrupt the visual-language pre-training process and affect the model's learning ability. This work aims to develop a retinal foundation model, called ViLReF, by pre-training on a paired dataset comprising 451,956 retinal images and corresponding diagnostic text reports. In our vision-language pre-training strategy, we leverage expert knowledge to facilitate the extraction of labels and propose a novel constraint, the Weighted Similarity Coupling Loss, to adjust the speed of pushing sample pairs further apart dynamically within the feature space. Furthermore, we employ a batch expansion module with dynamic memory queues, maintained by momentum encoders, to supply extra samples and compensate for the vacancies caused by eliminating false negatives. Extensive experiments are conducted on multiple datasets for downstream classification and segmentation tasks. The experimental results demonstrate the powerful zero-shot and transfer learning capabilities of ViLReF, verifying the effectiveness of our pre-training strategy. Our ViLReF model is available at: https://github.com/T6Yang/ViLReF.

8/21/2024