Hidden Flaws Behind Expert-Level Accuracy of GPT-4 Vision in Medicine

Read original: arXiv:2401.08396 - Published 9/4/2024 by Qiao Jin, Fangyuan Chen, Yiliang Zhou, Ziyang Xu, Justin M. Cheung, Robert Chen, Ronald M. Summers, Justin F. Rousseau, Peiyun Ni, Marc J Landsman and 8 others

🎯

Overview

Recent studies show that Generative Pre-trained Transformer 4 with Vision (GPT-4V) outperforms human physicians in medical challenge tasks, particularly in accurately answering multiple-choice questions.
However, these evaluations focused solely on the accuracy of the answers, without considering the quality of the reasoning or rationales provided by GPT-4V.
This study aims to conduct a more comprehensive analysis of GPT-4V's performance, including its ability to provide coherent explanations, recall medical knowledge, and engage in step-by-step multimodal reasoning when solving NEJM Image Challenges.

Plain English Explanation

The study looks at how well an AI system called GPT-4V, which is trained on both text and images, performs on medical challenge tasks compared to human doctors. Previous research has shown that GPT-4V is more accurate than doctors at answering multiple-choice questions in these challenges. However, this study wanted to look beyond just the final answers and examine how GPT-4V arrives at those answers.

The researchers had GPT-4V try to solve NEJM Image Challenges, which are quizzes designed to test medical professionals' knowledge and diagnostic abilities. They looked at how well GPT-4V could understand the images, recall relevant medical information, and use logical reasoning to work through the steps to arrive at the correct answer.

The results showed that GPT-4V performed about as well as human doctors in terms of getting the multiple-choice answers right. It was even able to correctly answer questions that the doctors had gotten wrong. However, the researchers found that GPT-4V frequently provided flawed or unconvincing explanations for its correct answers, especially when it came to understanding the images. This suggests that while the AI may be good at memorizing medical facts and patterns, it still struggles to truly comprehend the underlying medical concepts in the way that human experts do.

Technical Explanation

The researchers in this study evaluated the performance of GPT-4V, a multimodal AI model that can process both text and images, on NEJM Image Challenges - a set of medical imaging quizzes designed to test the diagnostic capabilities of medical professionals.

Unlike previous studies that only looked at the accuracy of GPT-4V's final multiple-choice answers, this research took a more comprehensive approach. They analyzed GPT-4V's ability to:

Understand and reason about the provided medical images (image comprehension)
Recall relevant medical knowledge to apply to the challenge (medical knowledge recall)
Engage in step-by-step logical reasoning to arrive at the final answer (multimodal reasoning)

The results showed that GPT-4V performed comparatively to human physicians in terms of getting the multiple-choice answers correct (81.6% vs. 77.8%). Impressively, GPT-4V was also able to correctly answer over 78% of the questions that the human doctors had gotten wrong.

However, the researchers discovered a concerning issue - GPT-4V frequently presented flawed or unconvincing rationales for its correct answers, especially in the area of image comprehension (27.2% of cases). This suggests that while the model may be able to memorize patterns and arrive at the right conclusions, it still struggles to truly understand the underlying medical concepts in the same way that human experts do.

Critical Analysis

The researchers in this study provide a nuanced and thoughtful evaluation of GPT-4V's performance on medical challenge tasks. While the model's high accuracy in answering multiple-choice questions is impressive, the findings about its problematic rationales are an important caveat that shouldn't be overlooked.

As the researchers note, further in-depth evaluations of GPT-4V's reasoning capabilities are necessary before such multimodal AI models can be safely integrated into clinical workflows. Simply getting the right answers is not enough - healthcare professionals need to be able to understand and trust the underlying logic that led to those answers.

Additionally, the researchers acknowledge that their study was limited to a specific set of NEJM Image Challenges, and the generalizability of the findings to other medical domains or real-world clinical scenarios remains to be seen. Further research is needed to evaluate GPT-4V's performance across a broader range of medical tasks and settings.

Overall, this study highlights the importance of looking beyond just accuracy metrics when evaluating the capabilities of advanced AI systems, especially in high-stakes domains like healthcare. The researchers have set a strong example of the kind of rigorous, multi-faceted evaluation that should be the standard for these technologies.

Conclusion

This study provides a nuanced and comprehensive evaluation of GPT-4V's performance on medical challenge tasks, going beyond just the accuracy of its final answers to examine the quality of its reasoning and explanations.

The researchers found that while GPT-4V matched or exceeded human physicians in terms of multiple-choice accuracy, it frequently presented flawed or unconvincing rationales for its correct answers, particularly when it came to understanding and reasoning about medical images.

These findings underscore the importance of evaluating advanced AI systems like GPT-4V not just on their final outputs, but on the soundness and interpretability of their underlying decision-making processes. As the use of such multimodal AI models in healthcare continues to expand, further research and careful consideration of their strengths and limitations will be crucial to ensure they are deployed safely and responsibly.

Ultimately, this study highlights the potential of GPT-4V and similar models, but also the need for continued development and rigorous testing before they can be confidently integrated into high-stakes clinical workflows. The researchers have set an important example for the AI research community in their comprehensive and thoughtful evaluation approach.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🎯

Hidden Flaws Behind Expert-Level Accuracy of GPT-4 Vision in Medicine

Qiao Jin, Fangyuan Chen, Yiliang Zhou, Ziyang Xu, Justin M. Cheung, Robert Chen, Ronald M. Summers, Justin F. Rousseau, Peiyun Ni, Marc J Landsman, Sally L. Baxter, Subhi J. Al'Aref, Yijia Li, Alex Chen, Josef A. Brejt, Michael F. Chiang, Yifan Peng, Zhiyong Lu

Recent studies indicate that Generative Pre-trained Transformer 4 with Vision (GPT-4V) outperforms human physicians in medical challenge tasks. However, these evaluations primarily focused on the accuracy of multi-choice questions alone. Our study extends the current scope by conducting a comprehensive analysis of GPT-4V's rationales of image comprehension, recall of medical knowledge, and step-by-step multimodal reasoning when solving New England Journal of Medicine (NEJM) Image Challenges - an imaging quiz designed to test the knowledge and diagnostic capabilities of medical professionals. Evaluation results confirmed that GPT-4V performs comparatively to human physicians regarding multi-choice accuracy (81.6% vs. 77.8%). GPT-4V also performs well in cases where physicians incorrectly answer, with over 78% accuracy. However, we discovered that GPT-4V frequently presents flawed rationales in cases where it makes the correct final choices (35.5%), most prominent in image comprehension (27.2%). Regardless of GPT-4V's high accuracy in multi-choice questions, our findings emphasize the necessity for further in-depth evaluations of its rationales before integrating such multimodal AI models into clinical workflows.

9/4/2024

HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale

Junying Chen, Chi Gui, Ruyi Ouyang, Anningzhe Gao, Shunian Chen, Guiming Hardy Chen, Xidong Wang, Ruifei Zhang, Zhenyang Cai, Ke Ji, Guangjun Yu, Xiang Wan, Benyou Wang

The rapid development of multimodal large language models (MLLMs), such as GPT-4V, has led to significant advancements. However, these models still face challenges in medical multimodal capabilities due to limitations in the quantity and quality of medical vision-text data, stemming from data privacy concerns and high annotation costs. While pioneering approaches utilize PubMed's large-scale, de-identified medical image-text pairs to address these limitations, they still fall short due to inherent data noise. To tackle this, we refined medical image-text pairs from PubMed and employed MLLMs (GPT-4V) in an 'unblinded' capacity to denoise and reformat the data, resulting in the creation of the PubMedVision dataset with 1.3 million medical VQA samples. Our validation demonstrates that: (1) PubMedVision can significantly enhance the medical multimodal capabilities of current MLLMs, showing significant improvement in benchmarks including the MMMU Health & Medicine track; (2) manual checks by medical experts and empirical results validate the superior data quality of our dataset compared to other data construction methods. Using PubMedVision, we train a 34B medical MLLM HuatuoGPT-Vision, which shows superior performance in medical multimodal scenarios among open-source MLLMs.

9/17/2024

GPT-4V Cannot Generate Radiology Reports Yet

Yuyang Jiang, Chacha Chen, Dang Nguyen, Benjamin M. Mervak, Chenhao Tan

GPT-4V's purported strong multimodal abilities raise interests in using it to automate radiology report writing, but there lacks thorough evaluations. In this work, we perform a systematic evaluation of GPT-4V in generating radiology reports on two chest X-ray report datasets: MIMIC-CXR and IU X-Ray. We attempt to directly generate reports using GPT-4V through different prompting strategies and find that it fails terribly in both lexical metrics and clinical efficacy metrics. To understand the low performance, we decompose the task into two steps: 1) the medical image reasoning step of predicting medical condition labels from images; and 2) the report synthesis step of generating reports from (groundtruth) conditions. We show that GPT-4V's performance in image reasoning is consistently low across different prompts. In fact, the distributions of model-predicted labels remain constant regardless of which groundtruth conditions are present on the image, suggesting that the model is not interpreting chest X-rays meaningfully. Even when given groundtruth conditions in report synthesis, its generated reports are less correct and less natural-sounding than a finetuned LLaMA-2. Altogether, our findings cast doubt on the viability of using GPT-4V in a radiology workflow.

7/18/2024

Evaluating GPT-4 with Vision on Detection of Radiological Findings on Chest Radiographs

Yiliang Zhou, Hanley Ong, Patrick Kennedy, Carol Wu, Jacob Kazam, Keith Hentel, Adam Flanders, George Shih, Yifan Peng

The study examines the application of GPT-4V, a multi-modal large language model equipped with visual recognition, in detecting radiological findings from a set of 100 chest radiographs and suggests that GPT-4V is currently not ready for real-world diagnostic usage in interpreting chest radiographs.

5/15/2024