Benchmarking Large Language Models on Answering and Explaining Challenging Medical Questions

2402.18060

Published 6/27/2024 by Hanjie Chen, Zhouxiang Fang, Yash Singla, Mark Dredze

Benchmarking Large Language Models on Answering and Explaining Challenging Medical Questions

Abstract

LLMs have demonstrated impressive performance in answering medical questions, such as achieving passing scores on medical licensing examinations. However, medical board exam or general clinical questions do not capture the complexity of realistic clinical cases. Moreover, the lack of reference explanations means we cannot easily evaluate the reasoning of model decisions, a crucial component of supporting doctors in making complex medical decisions. To address these challenges, we construct two new datasets: JAMA Clinical Challenge and Medbullets. JAMA Clinical Challenge consists of questions based on challenging clinical cases, while Medbullets comprises simulated clinical questions. Both datasets are structured as multiple-choice question-answering tasks, accompanied by expert-written explanations. We evaluate seven LLMs on the two datasets using various prompts. Experiments demonstrate that our datasets are harder than previous benchmarks. Human and automatic evaluations of model-generated explanations provide insights into the promise and deficiency of LLMs for explainable medical QA.

Create account to get full access

Overview

This paper benchmarks the performance of large language models (LLMs) on answering and explaining challenging medical questions.
The researchers used two medical datasets - the JAMA Clinical Challenge and MedQA - to evaluate the capabilities of LLMs like GPT-3, PaLM, and Megatron-Turing NLG.
The experiments assessed the models' ability to provide accurate and informative answers, as well as explanations for their responses.
The findings offer insights into the current state of medical knowledge representation and reasoning in LLMs, and identify areas for future improvement.

Plain English Explanation

The paper looks at how well large language models, which are AI systems trained on massive amounts of text data, can answer and explain complex medical questions. The researchers used two medical datasets to test the models' capabilities.

The first dataset, called JAMA Clinical Challenge, contains real-world medical cases and questions that doctors would face. The second dataset, MedQA, has questions on a wider range of medical topics. The researchers wanted to see how well the language models could provide accurate and informative answers to these challenging medical questions, as well as explain their reasoning.

The findings offer insights into the current limitations of these large language models when it comes to medical knowledge and reasoning. While the models can often provide reasonable answers, they still struggle to fully explain their thought processes and the underlying medical concepts. This suggests there is room for improvement in how these models represent and apply medical knowledge.

By benchmarking the performance of these language models, the researchers hope to guide future developments in making AI systems that can better assist healthcare providers and patients with medical decision-making. Link to MedExQA, Link to MedExpQA, Link to Large Language Models in the Clinic, Link to Survey of Large Language Models, Link to MedReQA

Technical Explanation

The paper evaluates the performance of several prominent large language models, including GPT-3, PaLM, and Megatron-Turing NLG, on two medical question answering datasets. The first dataset, JAMA Clinical Challenge, contains real-world medical cases and questions that physicians would face in clinical practice. The second dataset, MedQA, covers a broader range of medical topics.

The researchers assessed the models' ability to provide accurate answers to the questions, as well as their capacity to generate informative explanations for their responses. They used both automatic evaluation metrics and human judgments to assess the quality of the answers and explanations.

The results show that while the language models can often provide reasonable answers to the medical questions, they struggle to fully explain their reasoning in a way that demonstrates a deep understanding of the underlying medical concepts. The models tend to produce generic, superficial explanations that do not always align with the nuances of the medical domain.

These findings suggest that current large language models, despite their impressive capabilities, still have significant limitations when it comes to representing and reasoning about complex medical knowledge. The researchers argue that further advancements in areas like medical knowledge representation, reasoning, and explainability will be necessary to develop AI systems that can truly assist healthcare providers and patients in medical decision-making.

Critical Analysis

The paper provides a valuable benchmark for assessing the current state of large language models in the medical domain, but it also highlights several important caveats and limitations that should be considered.

One key limitation is the relatively small size of the datasets used for evaluation, which may not fully capture the breadth and complexity of real-world medical knowledge. The researchers acknowledge that expanding the datasets, particularly with more diverse and challenging questions, could lead to different insights about the models' capabilities.

Additionally, the paper focuses primarily on the models' ability to provide accurate answers and explanations, but does not delve deeply into other clinically relevant factors, such as the models' sensitivity to nuances in medical terminology, their handling of ambiguity or uncertainty, or their ability to consider patient-specific factors in making recommendations.

Further research is also needed to better understand the specific shortcomings of the language models, such as the types of medical knowledge they struggle to represent or the reasoning patterns they fail to capture. Addressing these limitations will be crucial for developing AI systems that can truly complement and augment human medical expertise.

Link to MedExQA, Link to MedExpQA, Link to Large Language Models in the Clinic

Conclusion

This paper provides a valuable benchmark for assessing the performance of large language models on medical question answering and explanation tasks. The findings suggest that while these models can often provide reasonable answers, they still struggle to fully capture and explain the nuances of medical knowledge and reasoning.

By highlighting the current limitations of language models in the medical domain, the researchers hope to guide future research and development efforts towards creating AI systems that can better assist healthcare providers and patients. Advancements in areas like medical knowledge representation, reasoning, and explainability will be crucial for bridging the gap between the capabilities of these models and the demands of real-world medical practice.

Link to Survey of Large Language Models, Link to MedReQA

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

MedExQA: Medical Question Answering Benchmark with Multiple Explanations

Yunsoo Kim, Jinge Wu, Yusuf Abdulle, Honghan Wu

This paper introduces MedExQA, a novel benchmark in medical question-answering, to evaluate large language models' (LLMs) understanding of medical knowledge through explanations. By constructing datasets across five distinct medical specialties that are underrepresented in current datasets and further incorporating multiple explanations for each question-answer pair, we address a major gap in current medical QA benchmarks which is the absence of comprehensive assessments of LLMs' ability to generate nuanced medical explanations. Our work highlights the importance of explainability in medical LLMs, proposes an effective methodology for evaluating models beyond classification accuracy, and sheds light on one specific domain, speech language pathology, where current LLMs including GPT4 lack good understanding. Our results show generation evaluation with multiple explanations aligns better with human assessment, highlighting an opportunity for a more robust automated comprehension assessment for LLMs. To diversify open-source medical LLMs (currently mostly based on Llama2), this work also proposes a new medical model, MedPhi-2, based on Phi-2 (2.7B). The model outperformed medical LLMs based on Llama2-70B in generating explanations, showing its effectiveness in the resource-constrained medical domain. We will share our benchmark datasets and the trained model.

6/11/2024

cs.CL cs.AI

MedExpQA: Multilingual Benchmarking of Large Language Models for Medical Question Answering

I~nigo Alonso, Maite Oronoz, Rodrigo Agerri

Large Language Models (LLMs) have the potential of facilitating the development of Artificial Intelligence technology to assist medical experts for interactive decision support, which has been demonstrated by their competitive performances in Medical QA. However, while impressive, the required quality bar for medical applications remains far from being achieved. Currently, LLMs remain challenged by outdated knowledge and by their tendency to generate hallucinated content. Furthermore, most benchmarks to assess medical knowledge lack reference gold explanations which means that it is not possible to evaluate the reasoning of LLMs predictions. Finally, the situation is particularly grim if we consider benchmarking LLMs for languages other than English which remains, as far as we know, a totally neglected topic. In order to address these shortcomings, in this paper we present MedExpQA, the first multilingual benchmark based on medical exams to evaluate LLMs in Medical Question Answering. To the best of our knowledge, MedExpQA includes for the first time reference gold explanations written by medical doctors which can be leveraged to establish various gold-based upper-bounds for comparison with LLMs performance. Comprehensive multilingual experimentation using both the gold reference explanations and Retrieval Augmented Generation (RAG) approaches show that performance of LLMs still has large room for improvement, especially for languages other than English. Furthermore, and despite using state-of-the-art RAG methods, our results also demonstrate the difficulty of obtaining and integrating readily available medical knowledge that may positively impact results on downstream evaluations for Medical Question Answering. So far the benchmark is available in four languages, but we hope that this work may encourage further development to other languages.

4/9/2024

cs.CL

Large Language Models in Healthcare: A Comprehensive Benchmark

Andrew Liu, Hongjian Zhou, Yining Hua, Omid Rohanian, Anshul Thakur, Lei Clifton, David A. Clifton

The adoption of large language models (LLMs) to assist clinicians has attracted remarkable attention. Existing works mainly adopt the close-ended question-answering (QA) task with answer options for evaluation. However, many clinical decisions involve answering open-ended questions without pre-set options. To better understand LLMs in the clinic, we construct a benchmark ClinicBench. We first collect eleven existing datasets covering diverse clinical language generation, understanding, and reasoning tasks. Furthermore, we construct six novel datasets and complex clinical tasks that are close to real-world practice, i.e., referral QA, treatment recommendation, hospitalization (long document) summarization, patient education, pharmacology QA and drug interaction for emerging drugs. We conduct an extensive evaluation of twenty-two LLMs under both zero-shot and few-shot settings. Finally, we invite medical experts to evaluate the clinical usefulness of LLMs.

6/27/2024

cs.CL cs.AI

A Survey on Large Language Models from General Purpose to Medical Applications: Datasets, Methodologies, and Evaluations

Jinqiang Wang, Huansheng Ning, Yi Peng, Qikai Wei, Daniel Tesfai, Wenwei Mao, Tao Zhu, Runhe Huang

Large Language Models (LLMs) have demonstrated surprising performance across various natural language processing tasks. Recently, medical LLMs enhanced with domain-specific knowledge have exhibited excellent capabilities in medical consultation and diagnosis. These models can smoothly simulate doctor-patient dialogues and provide professional medical advice. Most medical LLMs are developed through continued training of open-source general LLMs, which require significantly fewer computational resources than training LLMs from scratch. Additionally, this approach offers better protection of patient privacy compared to API-based solutions. This survey systematically explores how to train medical LLMs based on general LLMs. It covers: (a) how to acquire training corpus and construct customized medical training sets, (b) how to choose a appropriate training paradigm, (c) how to choose a suitable evaluation benchmark, and (d) existing challenges and promising future research directions are discussed. This survey can provide guidance for the development of LLMs focused on various medical applications, such as medical education, diagnostic planning, and clinical assistants.

6/18/2024

cs.CL cs.AI