Hidden Flaws Behind Expert-Level Accuracy of GPT-4 Vision in Medicine

2401.08396

Published 4/24/2024 by Qiao Jin, Fangyuan Chen, Yiliang Zhou, Ziyang Xu, Justin M. Cheung, Robert Chen, Ronald M. Summers, Justin F. Rousseau, Peiyun Ni, Marc J Landsman and 8 others

cs.CV cs.AI cs.CL

🎯

Abstract

Recent studies indicate that Generative Pre-trained Transformer 4 with Vision (GPT-4V) outperforms human physicians in medical challenge tasks. However, these evaluations primarily focused on the accuracy of multi-choice questions alone. Our study extends the current scope by conducting a comprehensive analysis of GPT-4V's rationales of image comprehension, recall of medical knowledge, and step-by-step multimodal reasoning when solving New England Journal of Medicine (NEJM) Image Challenges - an imaging quiz designed to test the knowledge and diagnostic capabilities of medical professionals. Evaluation results confirmed that GPT-4V performs comparatively to human physicians regarding multi-choice accuracy (81.6% vs. 77.8%). GPT-4V also performs well in cases where physicians incorrectly answer, with over 78% accuracy. However, we discovered that GPT-4V frequently presents flawed rationales in cases where it makes the correct final choices (35.5%), most prominent in image comprehension (27.2%). Regardless of GPT-4V's high accuracy in multi-choice questions, our findings emphasize the necessity for further in-depth evaluations of its rationales before integrating such multimodal AI models into clinical workflows.

Get summaries of the top AI research delivered straight to your inbox:

Overview

Recent studies show that Generative Pre-trained Transformer 4 with Vision (GPT-4V) outperforms human physicians in medical challenge tasks, particularly in accurately answering multiple-choice questions.
However, these evaluations focused solely on the accuracy of the answers, without considering the quality of the reasoning or rationales provided by GPT-4V.
This study aims to conduct a more comprehensive analysis of GPT-4V's performance, including its ability to provide coherent explanations, recall medical knowledge, and engage in step-by-step multimodal reasoning when solving NEJM Image Challenges.

Plain English Explanation

The study looks at how well an AI system called GPT-4V, which is trained on both text and images, performs on medical challenge tasks compared to human doctors. Previous research has shown that GPT-4V is more accurate than doctors at answering multiple-choice questions in these challenges. However, this study wanted to look beyond just the final answers and examine how GPT-4V arrives at those answers.

The researchers had GPT-4V try to solve NEJM Image Challenges, which are quizzes designed to test medical professionals' knowledge and diagnostic abilities. They looked at how well GPT-4V could understand the images, recall relevant medical information, and use logical reasoning to work through the steps to arrive at the correct answer.

The results showed that GPT-4V performed about as well as human doctors in terms of getting the multiple-choice answers right. It was even able to correctly answer questions that the doctors had gotten wrong. However, the researchers found that GPT-4V frequently provided flawed or unconvincing explanations for its correct answers, especially when it came to understanding the images. This suggests that while the AI may be good at memorizing medical facts and patterns, it still struggles to truly comprehend the underlying medical concepts in the way that human experts do.

Technical Explanation

The researchers in this study evaluated the performance of GPT-4V, a multimodal AI model that can process both text and images, on NEJM Image Challenges - a set of medical imaging quizzes designed to test the diagnostic capabilities of medical professionals.

Unlike previous studies that only looked at the accuracy of GPT-4V's final multiple-choice answers, this research took a more comprehensive approach. They analyzed GPT-4V's ability to:

Understand and reason about the provided medical images (image comprehension)
Recall relevant medical knowledge to apply to the challenge (medical knowledge recall)
Engage in step-by-step logical reasoning to arrive at the final answer (multimodal reasoning)

The results showed that GPT-4V performed comparatively to human physicians in terms of getting the multiple-choice answers correct (81.6% vs. 77.8%). Impressively, GPT-4V was also able to correctly answer over 78% of the questions that the human doctors had gotten wrong.

However, the researchers discovered a concerning issue - GPT-4V frequently presented flawed or unconvincing rationales for its correct answers, especially in the area of image comprehension (27.2% of cases). This suggests that while the model may be able to memorize patterns and arrive at the right conclusions, it still struggles to truly understand the underlying medical concepts in the same way that human experts do.

Critical Analysis

The researchers in this study provide a nuanced and thoughtful evaluation of GPT-4V's performance on medical challenge tasks. While the model's high accuracy in answering multiple-choice questions is impressive, the findings about its problematic rationales are an important caveat that shouldn't be overlooked.

As the researchers note, further in-depth evaluations of GPT-4V's reasoning capabilities are necessary before such multimodal AI models can be safely integrated into clinical workflows. Simply getting the right answers is not enough - healthcare professionals need to be able to understand and trust the underlying logic that led to those answers.

Additionally, the researchers acknowledge that their study was limited to a specific set of NEJM Image Challenges, and the generalizability of the findings to other medical domains or real-world clinical scenarios remains to be seen. Further research is needed to evaluate GPT-4V's performance across a broader range of medical tasks and settings.

Overall, this study highlights the importance of looking beyond just accuracy metrics when evaluating the capabilities of advanced AI systems, especially in high-stakes domains like healthcare. The researchers have set a strong example of the kind of rigorous, multi-faceted evaluation that should be the standard for these technologies.

Conclusion

This study provides a nuanced and comprehensive evaluation of GPT-4V's performance on medical challenge tasks, going beyond just the accuracy of its final answers to examine the quality of its reasoning and explanations.

The researchers found that while GPT-4V matched or exceeded human physicians in terms of multiple-choice accuracy, it frequently presented flawed or unconvincing rationales for its correct answers, particularly when it came to understanding and reasoning about medical images.

These findings underscore the importance of evaluating advanced AI systems like GPT-4V not just on their final outputs, but on the soundness and interpretability of their underlying decision-making processes. As the use of such multimodal AI models in healthcare continues to expand, further research and careful consideration of their strengths and limitations will be crucial to ensure they are deployed safely and responsibly.

Ultimately, this study highlights the potential of GPT-4V and similar models, but also the need for continued development and rigorous testing before they can be confidently integrated into high-stakes clinical workflows. The researchers have set an important example for the AI research community in their comprehensive and thoughtful evaluation approach.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Evaluating GPT-4 with Vision on Detection of Radiological Findings on Chest Radiographs

Yiliang Zhou, Hanley Ong, Patrick Kennedy, Carol Wu, Jacob Kazam, Keith Hentel, Adam Flanders, George Shih, Yifan Peng

The study examines the application of GPT-4V, a multi-modal large language model equipped with visual recognition, in detecting radiological findings from a set of 100 chest radiographs and suggests that GPT-4V is currently not ready for real-world diagnostic usage in interpreting chest radiographs.

5/15/2024

eess.IV cs.AI cs.CV

Realizing Visual Question Answering for Education: GPT-4V as a Multimodal AI

Gyeong-Geon Lee, Xiaoming Zhai

Educational scholars have analyzed various image data acquired from teaching and learning situations, such as photos that shows classroom dynamics, students' drawings with regard to the learning content, textbook illustrations, etc. Unquestioningly, most qualitative analysis of and explanation on image data have been conducted by human researchers, without machine-based automation. It was partially because most image processing artificial intelligence models were not accessible to general educational scholars or explainable due to their complex deep neural network architecture. However, the recent development of Visual Question Answering (VQA) techniques is accomplishing usable visual language models, which receive from the user a question about the given image and returns an answer, both in natural language. Particularly, GPT-4V released by OpenAI, has wide opened the state-of-the-art visual langauge model service so that VQA could be used for a variety of purposes. However, VQA and GPT-4V have not yet been applied to educational studies much. In this position paper, we suggest that GPT-4V contributes to realizing VQA for education. By 'realizing' VQA, we denote two meanings: (1) GPT-4V realizes the utilization of VQA techniques by any educational scholars without technical/accessibility barrier, and (2) GPT-4V makes educational scholars realize the usefulness of VQA to educational research. Given these, this paper aims to introduce VQA for educational studies so that it provides a milestone for educational research methodology. In this paper, chapter II reviews the development of VQA techniques, which primes with the release of GPT-4V. Chapter III reviews the use of image analysis in educational studies. Chapter IV demonstrates how GPT-4V can be used for each research usage reviewed in Chapter III, with operating prompts provided. Finally, chapter V discusses the future implications.

5/14/2024

cs.AI

✅

GPT-4V(ision) for Robotics: Multimodal Task Planning from Human Demonstration

Naoki Wake, Atsushi Kanehira, Kazuhiro Sasabuchi, Jun Takamatsu, Katsushi Ikeuchi

We introduce a pipeline that enhances a general-purpose Vision Language Model, GPT-4V(ision), to facilitate one-shot visual teaching for robotic manipulation. This system analyzes videos of humans performing tasks and outputs executable robot programs that incorporate insights into affordances. The process begins with GPT-4V analyzing the videos to obtain textual explanations of environmental and action details. A GPT-4-based task planner then encodes these details into a symbolic task plan. Subsequently, vision systems spatially and temporally ground the task plan in the videos. Object are identified using an open-vocabulary object detector, and hand-object interactions are analyzed to pinpoint moments of grasping and releasing. This spatiotemporal grounding allows for the gathering of affordance information (e.g., grasp types, waypoints, and body postures) critical for robot execution. Experiments across various scenarios demonstrate the method's efficacy in achieving real robots' operations from human demonstrations in a one-shot manner. Meanwhile, quantitative tests have revealed instances of hallucination in GPT-4V, highlighting the importance of incorporating human supervision within the pipeline. The prompts of GPT-4V/GPT-4 are available at this project page:

5/7/2024

cs.RO cs.CL cs.CV

🌀

Harnessing GPT-4V(ision) for Insurance: A Preliminary Exploration

Chenwei Lin, Hanjia Lyu, Jiebo Luo, Xian Xu

The emergence of Large Multimodal Models (LMMs) marks a significant milestone in the development of artificial intelligence. Insurance, as a vast and complex discipline, involves a wide variety of data forms in its operational processes, including text, images, and videos, thereby giving rise to diverse multimodal tasks. Despite this, there has been limited systematic exploration of multimodal tasks specific to insurance, nor a thorough investigation into how LMMs can address these challenges. In this paper, we explore GPT-4V's capabilities in the insurance domain. We categorize multimodal tasks by focusing primarily on visual aspects based on types of insurance (e.g., auto, household/commercial property, health, and agricultural insurance) and insurance stages (e.g., risk assessment, risk monitoring, and claims processing). Our experiment reveals that GPT-4V exhibits remarkable abilities in insurance-related tasks, demonstrating not only a robust understanding of multimodal content in the insurance domain but also a comprehensive knowledge of insurance scenarios. However, there are notable shortcomings: GPT-4V struggles with detailed risk rating and loss assessment, suffers from hallucination in image understanding, and shows variable support for different languages. Through this work, we aim to bridge the insurance domain with cutting-edge LMM technology, facilitate interdisciplinary exchange and development, and provide a foundation for the continued advancement and evolution of future research endeavors.

4/16/2024

cs.CV cs.AI cs.CL cs.LG