Evaluating GPT-4 with Vision on Detection of Radiological Findings on Chest Radiographs

2403.15528

Published 5/15/2024 by Yiliang Zhou, Hanley Ong, Patrick Kennedy, Carol Wu, Jacob Kazam, Keith Hentel, Adam Flanders, George Shih, Yifan Peng

eess.IV cs.AI cs.CV

Evaluating GPT-4 with Vision on Detection of Radiological Findings on Chest Radiographs

Abstract

The study examines the application of GPT-4V, a multi-modal large language model equipped with visual recognition, in detecting radiological findings from a set of 100 chest radiographs and suggests that GPT-4V is currently not ready for real-world diagnostic usage in interpreting chest radiographs.

Get summaries of the top AI research delivered straight to your inbox:

Overview

The paper evaluates the performance of GPT-V4, a vision-enabled version of GPT-4, on the task of detecting radiological findings in chest X-ray images.
The study aims to assess the potential of large language models with visual capabilities for medical image analysis and decision support.
The researchers used a dataset of chest X-ray images and corresponding radiologist-annotated findings to evaluate GPT-V4's performance.

Plain English Explanation

The paper explores the use of a powerful artificial intelligence (AI) model called GPT-V4 to analyze chest X-ray images and detect any abnormalities or medical findings. GPT-V4 is a more advanced version of the GPT-4 language model, with the added ability to process visual information.

The researchers wanted to see how well GPT-V4 could identify different medical conditions or issues that might be visible in chest X-ray images, such as internal links pneumonia, internal links lung cancer, or internal links heart problems. This could be useful for helping doctors and radiologists analyze X-rays more quickly and accurately, potentially leading to faster diagnoses and better patient care.

The researchers used a dataset of chest X-ray images that had already been analyzed by human experts, who had identified any abnormalities or medical findings in the images. They then gave these X-ray images to the GPT-V4 model and measured how well it was able to detect the same findings that the human experts had identified.

By testing GPT-V4's performance on this task, the researchers hope to better understand the potential of large language models with visual capabilities, like internal links MedPromptX, for assisting in medical image analysis and decision-making.

Technical Explanation

The paper describes an experiment to evaluate the performance of GPT-V4, a multimodal language model that can process both text and images, on the task of detecting radiological findings in chest X-ray images.

The researchers used a dataset of chest X-ray images, each of which had been annotated by radiologists to identify various medical findings, such as internal links pneumonia, lung nodules, or pleural effusions. They then tested the GPT-V4 model by presenting it with these X-ray images and prompting it to detect and describe any relevant radiological findings.

The study design involved several steps:

Data collection: The researchers gathered a dataset of chest X-ray images with corresponding radiologist annotations.
Model evaluation: They assessed the performance of GPT-V4 on the task of detecting radiological findings in the X-ray images, comparing its output to the ground truth annotations provided by the radiologists.
Qualitative analysis: The researchers also conducted a qualitative analysis, examining the types of findings GPT-V4 was able to detect and the quality of its descriptions.

The results of the study provide insights into the potential of large language models with visual capabilities, such as GPT-V4, to assist in medical image analysis and decision support. The findings suggest that these models may be able to accurately detect a range of radiological findings and could potentially augment or enhance the work of human radiologists.

Critical Analysis

The paper provides a promising initial evaluation of GPT-V4's performance on the task of detecting radiological findings in chest X-ray images. However, the researchers acknowledge several limitations and areas for further research:

Dataset size and diversity: The dataset used in the study was relatively small and may not have captured the full range of radiological findings or patient demographics encountered in clinical practice. Expanding the dataset could help better assess the model's generalizability.
Comparison to human radiologists: While the study compared GPT-V4's performance to the ground truth annotations provided by radiologists, it did not directly compare the model's performance to that of human radiologists. Conducting such a comparison would be valuable to understand the relative strengths and weaknesses of the model.
Real-world clinical integration: The paper focused on the model's technical performance, but it did not address the practical challenges of integrating such a system into clinical workflows and decision-making processes. Further research is needed to understand the feasibility and potential impact of deploying GPT-V4 or similar models in real-world medical settings.
Ethical considerations: The use of AI-powered systems in medical diagnosis raises important ethical questions, such as issues of transparency, accountability, and the potential for bias. The paper does not address these considerations, which should be an important focus of future research in this area.

Overall, the paper presents promising results and highlights the potential of large language models with visual capabilities for assisting in medical image analysis. However, more research is needed to fully understand the practical implications and limitations of these models in real-world clinical settings.

Conclusion

The paper evaluates the performance of GPT-V4, a multimodal language model that can process both text and images, on the task of detecting radiological findings in chest X-ray images. The study suggests that GPT-V4 can accurately identify a range of medical conditions and abnormalities, potentially augmenting the work of human radiologists and improving the speed and accuracy of medical image analysis.

While the results are promising, the researchers acknowledge several limitations and areas for further research, such as the need for larger and more diverse datasets, direct comparisons to human radiologists, and a deeper consideration of the practical and ethical implications of deploying such models in clinical settings. Nonetheless, this study contributes to the growing body of research exploring the potential of large language models with visual capabilities for medical image analysis and decision support.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🎯

Hidden Flaws Behind Expert-Level Accuracy of GPT-4 Vision in Medicine

Qiao Jin, Fangyuan Chen, Yiliang Zhou, Ziyang Xu, Justin M. Cheung, Robert Chen, Ronald M. Summers, Justin F. Rousseau, Peiyun Ni, Marc J Landsman, Sally L. Baxter, Subhi J. Al'Aref, Yijia Li, Alex Chen, Josef A. Brejt, Michael F. Chiang, Yifan Peng, Zhiyong Lu

Recent studies indicate that Generative Pre-trained Transformer 4 with Vision (GPT-4V) outperforms human physicians in medical challenge tasks. However, these evaluations primarily focused on the accuracy of multi-choice questions alone. Our study extends the current scope by conducting a comprehensive analysis of GPT-4V's rationales of image comprehension, recall of medical knowledge, and step-by-step multimodal reasoning when solving New England Journal of Medicine (NEJM) Image Challenges - an imaging quiz designed to test the knowledge and diagnostic capabilities of medical professionals. Evaluation results confirmed that GPT-4V performs comparatively to human physicians regarding multi-choice accuracy (81.6% vs. 77.8%). GPT-4V also performs well in cases where physicians incorrectly answer, with over 78% accuracy. However, we discovered that GPT-4V frequently presents flawed rationales in cases where it makes the correct final choices (35.5%), most prominent in image comprehension (27.2%). Regardless of GPT-4V's high accuracy in multi-choice questions, our findings emphasize the necessity for further in-depth evaluations of its rationales before integrating such multimodal AI models into clinical workflows.

4/24/2024

cs.CV cs.AI cs.CL

Towards a clinically accessible radiology foundation model: open-access and lightweight, with automated evaluation

Juan Manuel Zambrano Chaves, Shih-Cheng Huang, Yanbo Xu, Hanwen Xu, Naoto Usuyama, Sheng Zhang, Fei Wang, Yujia Xie, Mahmoud Khademi, Ziyi Yang, Hany Awadalla, Julia Gong, Houdong Hu, Jianwei Yang, Chunyuan Li, Jianfeng Gao, Yu Gu, Cliff Wong, Mu Wei, Tristan Naumann, Muhao Chen, Matthew P. Lungren, Serena Yeung-Levy, Curtis P. Langlotz, Sheng Wang, Hoifung Poon

The scaling laws and extraordinary performance of large foundation models motivate the development and utilization of such models in biomedicine. However, despite early promising results on some biomedical benchmarks, there are still major challenges that need to be addressed before these models can be used in real-world clinics. Frontier general-domain models such as GPT-4V still have significant performance gaps in multimodal biomedical applications. More importantly, less-acknowledged pragmatic issues, including accessibility, model cost, and tedious manual evaluation make it hard for clinicians to use state-of-the-art large models directly on private patient data. Here, we explore training open-source small multimodal models (SMMs) to bridge competency gaps for unmet clinical needs in radiology. To maximize data efficiency, we adopt a modular approach by incorporating state-of-the-art pre-trained models for image and text modalities, and focusing on training a lightweight adapter to ground each modality to the text embedding space, as exemplified by LLaVA-Med. For training, we assemble a large dataset of over 697 thousand radiology image-text pairs. For evaluation, we propose CheXprompt, a GPT-4-based metric for factuality evaluation, and demonstrate its parity with expert evaluation. For best practice, we conduct a systematic ablation study on various choices in data engineering and multimodal training. The resulting LlaVA-Rad (7B) model attains state-of-the-art results on standard radiology tasks such as report generation and cross-modal retrieval, even outperforming much larger models such as GPT-4V and Med-PaLM M (84B). The inference of LlaVA-Rad is fast and can be performed on a single V100 GPU in private settings, offering a promising state-of-the-art tool for real-world clinical applications.

5/14/2024

cs.CL cs.CV

👀

Pixels and Predictions: Potential of GPT-4V in Meteorological Imagery Analysis and Forecast Communication

John R. Lawson, Montgomery L. Flora, Kevin H. Goebbert, Seth N. Lyman, Corey K. Potvin, David M. Schultz, Adam J. Stepanek, Joseph E. Trujillo-Falc'on

Generative AI, such as OpenAI's GPT-4V large-language model, has rapidly entered mainstream discourse. Novel capabilities in image processing and natural-language communication may augment existing forecasting methods. Large language models further display potential to better communicate weather hazards in a style honed for diverse communities and different languages. This study evaluates GPT-4V's ability to interpret meteorological charts and communicate weather hazards appropriately to the user, despite challenges of hallucinations, where generative AI delivers coherent, confident, but incorrect responses. We assess GPT-4V's competence via its web interface ChatGPT in two tasks: (1) generating a severe-weather outlook from weather-chart analysis and conducting self-evaluation, revealing an outlook that corresponds well with a Storm Prediction Center human-issued forecast; and (2) producing hazard summaries in Spanish and English from weather charts. Responses in Spanish, however, resemble direct (not idiomatic) translations from English to Spanish, yielding poorly translated summaries that lose critical idiomatic precision required for optimal communication. Our findings advocate for cautious integration of tools like GPT-4V in meteorology, underscoring the necessity of human oversight and development of trustworthy, explainable AI.

4/24/2024

cs.CL cs.AI

🗣️

Letter to the Editor: What are the legal and ethical considerations of submitting radiology reports to ChatGPT?

Siddharth Agarwal, David Wood, Robin Carpenter, Yiran Wei, Marc Modat, Thomas C Booth

This letter critically examines the recent article by Infante et al. assessing the utility of large language models (LLMs) like GPT-4, Perplexity, and Bard in identifying urgent findings in emergency radiology reports. While acknowledging the potential of LLMs in generating labels for computer vision, concerns are raised about the ethical implications of using patient data without explicit approval, highlighting the necessity of stringent data protection measures under GDPR.

5/10/2024

cs.CV