MedExQA: Medical Question Answering Benchmark with Multiple Explanations

2406.06331

Published 6/11/2024 by Yunsoo Kim, Jinge Wu, Yusuf Abdulle, Honghan Wu

MedExQA: Medical Question Answering Benchmark with Multiple Explanations

Abstract

This paper introduces MedExQA, a novel benchmark in medical question-answering, to evaluate large language models' (LLMs) understanding of medical knowledge through explanations. By constructing datasets across five distinct medical specialties that are underrepresented in current datasets and further incorporating multiple explanations for each question-answer pair, we address a major gap in current medical QA benchmarks which is the absence of comprehensive assessments of LLMs' ability to generate nuanced medical explanations. Our work highlights the importance of explainability in medical LLMs, proposes an effective methodology for evaluating models beyond classification accuracy, and sheds light on one specific domain, speech language pathology, where current LLMs including GPT4 lack good understanding. Our results show generation evaluation with multiple explanations aligns better with human assessment, highlighting an opportunity for a more robust automated comprehension assessment for LLMs. To diversify open-source medical LLMs (currently mostly based on Llama2), this work also proposes a new medical model, MedPhi-2, based on Phi-2 (2.7B). The model outperformed medical LLMs based on Llama2-70B in generating explanations, showing its effectiveness in the resource-constrained medical domain. We will share our benchmark datasets and the trained model.

Create account to get full access

Overview

This paper introduces MedExQA, a new benchmark for evaluating the medical knowledge and reasoning capabilities of large language models.
MedExQA consists of multiple-choice questions on various medical topics, with each question accompanied by several plausible explanations.
The benchmark aims to assess not just the models' ability to answer questions correctly, but also their capacity to provide coherent and informative explanations for their answers.

Plain English Explanation

MedExQA: Medical Question Answering Benchmark with Multiple Explanations is a new tool for testing how well large language models, such as GPT-3 or BERT, understand and can explain medical information. The researchers created a set of multiple-choice questions on different medical topics, and for each question, they also provided several possible explanations for the correct answer.

The goal of this benchmark is to go beyond just checking if the models can select the right answer. It also tests whether the models can provide clear and informative explanations for their choices. This is important because in many real-world applications, being able to explain the reasoning behind an answer is just as crucial as getting the answer right.

By assessing both the accuracy and the quality of explanations, MedExQA aims to give a more comprehensive evaluation of a model's medical knowledge and its ability to apply that knowledge effectively. This could be useful for developing more capable and trustworthy AI systems for medical applications, where having a clear understanding of the model's reasoning is essential.

Technical Explanation

MedExQA: Medical Question Answering Benchmark with Multiple Explanations introduces a new dataset and evaluation framework for assessing the medical knowledge and reasoning capabilities of large language models. The dataset consists of multiple-choice questions on various medical topics, with each question accompanied by several plausible explanations for the correct answer.

The researchers designed MedExQA to go beyond simply measuring a model's ability to select the right answer. The benchmark also evaluates the model's capacity to provide coherent and informative explanations for its choices. This is an important aspect of medical decision-making, where being able to understand the reasoning behind a recommendation is crucial for building trust and ensuring appropriate application of the model's outputs.

To create the dataset, the authors curated questions and explanations from medical textbooks, journals, and online resources. They carefully constructed the multiple-choice options and explanations to ensure that the task requires genuine medical knowledge and reasoning, rather than just superficial pattern matching.

The researchers then benchmark several state-of-the-art large language models on the MedExQA dataset, assessing both their question-answering accuracy and the quality of their provided explanations. They find that while the models perform reasonably well on the question-answering task, their explanation generation capabilities still have room for improvement.

Critical Analysis

The MedExQA: Medical Question Answering Benchmark with Multiple Explanations paper presents a valuable contribution to the field of medical AI research. By introducing a benchmark that goes beyond just measuring question-answering accuracy, the authors have highlighted the importance of explanation quality in developing trustworthy and interpretable medical AI systems.

One potential limitation of the study is the size and diversity of the MedExQA dataset. While the authors have made efforts to curate a broad range of medical topics, the dataset may not be comprehensive enough to fully capture the breadth of medical knowledge required in real-world applications. Expanding the dataset with more questions and explanations, potentially in multiple languages, could help to further strengthen the benchmark.

Additionally, the paper does not provide a detailed analysis of the types of errors or weaknesses observed in the models' explanations. A deeper investigation into the common flaws or biases in the models' reasoning could help guide future research and development efforts to improve explanation quality.

MedReQAL: Examining Medical Knowledge Recall in Large Language Models and Multiple Choice Questions for Large Language Models: A Case Study on the Medical Domain are two related works that also explore the challenges of evaluating medical knowledge in large language models. Considering the insights from these studies could further enrich the critical analysis of the MedExQA benchmark.

Conclusion

MedExQA: Medical Question Answering Benchmark with Multiple Explanations represents an important step forward in the development of more comprehensive and trustworthy medical AI systems. By focusing not just on question-answering accuracy but also on the quality of model explanations, the benchmark highlights the crucial role of interpretability and transparency in medical decision-making.

The findings from this study could help guide the future development of large language models for medical applications, pushing researchers to prioritize the improvement of explanation generation capabilities alongside traditional performance metrics. Ultimately, this could lead to the creation of AI systems that are not only knowledgeable but also able to effectively communicate their reasoning, fostering greater trust and better-informed decision-making in the healthcare domain.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

MedExpQA: Multilingual Benchmarking of Large Language Models for Medical Question Answering

I~nigo Alonso, Maite Oronoz, Rodrigo Agerri

Large Language Models (LLMs) have the potential of facilitating the development of Artificial Intelligence technology to assist medical experts for interactive decision support, which has been demonstrated by their competitive performances in Medical QA. However, while impressive, the required quality bar for medical applications remains far from being achieved. Currently, LLMs remain challenged by outdated knowledge and by their tendency to generate hallucinated content. Furthermore, most benchmarks to assess medical knowledge lack reference gold explanations which means that it is not possible to evaluate the reasoning of LLMs predictions. Finally, the situation is particularly grim if we consider benchmarking LLMs for languages other than English which remains, as far as we know, a totally neglected topic. In order to address these shortcomings, in this paper we present MedExpQA, the first multilingual benchmark based on medical exams to evaluate LLMs in Medical Question Answering. To the best of our knowledge, MedExpQA includes for the first time reference gold explanations written by medical doctors which can be leveraged to establish various gold-based upper-bounds for comparison with LLMs performance. Comprehensive multilingual experimentation using both the gold reference explanations and Retrieval Augmented Generation (RAG) approaches show that performance of LLMs still has large room for improvement, especially for languages other than English. Furthermore, and despite using state-of-the-art RAG methods, our results also demonstrate the difficulty of obtaining and integrating readily available medical knowledge that may positively impact results on downstream evaluations for Medical Question Answering. So far the benchmark is available in four languages, but we hope that this work may encourage further development to other languages.

4/9/2024

cs.CL

Benchmarking Large Language Models on Answering and Explaining Challenging Medical Questions

Hanjie Chen, Zhouxiang Fang, Yash Singla, Mark Dredze

LLMs have demonstrated impressive performance in answering medical questions, such as achieving passing scores on medical licensing examinations. However, medical board exam or general clinical questions do not capture the complexity of realistic clinical cases. Moreover, the lack of reference explanations means we cannot easily evaluate the reasoning of model decisions, a crucial component of supporting doctors in making complex medical decisions. To address these challenges, we construct two new datasets: JAMA Clinical Challenge and Medbullets. JAMA Clinical Challenge consists of questions based on challenging clinical cases, while Medbullets comprises simulated clinical questions. Both datasets are structured as multiple-choice question-answering tasks, accompanied by expert-written explanations. We evaluate seven LLMs on the two datasets using various prompts. Experiments demonstrate that our datasets are harder than previous benchmarks. Human and automatic evaluations of model-generated explanations provide insights into the promise and deficiency of LLMs for explainable medical QA.

6/27/2024

cs.CL

MedREQAL: Examining Medical Knowledge Recall of Large Language Models via Question Answering

Juraj Vladika, Phillip Schneider, Florian Matthes

In recent years, Large Language Models (LLMs) have demonstrated an impressive ability to encode knowledge during pre-training on large text corpora. They can leverage this knowledge for downstream tasks like question answering (QA), even in complex areas involving health topics. Considering their high potential for facilitating clinical work in the future, understanding the quality of encoded medical knowledge and its recall in LLMs is an important step forward. In this study, we examine the capability of LLMs to exhibit medical knowledge recall by constructing a novel dataset derived from systematic reviews -- studies synthesizing evidence-based answers for specific medical questions. Through experiments on the new MedREQAL dataset, comprising question-answer pairs extracted from rigorous systematic reviews, we assess six LLMs, such as GPT and Mixtral, analyzing their classification and generation performance. Our experimental insights into LLM performance on the novel biomedical QA dataset reveal the still challenging nature of this task.

6/11/2024

cs.CL

MedFuzz: Exploring the Robustness of Large Language Models in Medical Question Answering

Robert Osazuwa Ness, Katie Matton, Hayden Helm, Sheng Zhang, Junaid Bajwa, Carey E. Priebe, Eric Horvitz

Large language models (LLM) have achieved impressive performance on medical question-answering benchmarks. However, high benchmark accuracy does not imply that the performance generalizes to real-world clinical settings. Medical question-answering benchmarks rely on assumptions consistent with quantifying LLM performance but that may not hold in the open world of the clinic. Yet LLMs learn broad knowledge that can help the LLM generalize to practical conditions regardless of unrealistic assumptions in celebrated benchmarks. We seek to quantify how well LLM medical question-answering benchmark performance generalizes when benchmark assumptions are violated. Specifically, we present an adversarial method that we call MedFuzz (for medical fuzzing). MedFuzz attempts to modify benchmark questions in ways aimed at confounding the LLM. We demonstrate the approach by targeting strong assumptions about patient characteristics presented in the MedQA benchmark. Successful attacks modify a benchmark item in ways that would be unlikely to fool a medical expert but nonetheless trick the LLM into changing from a correct to an incorrect answer. Further, we present a permutation test technique that can ensure a successful attack is statistically significant. We show how to use performance on a MedFuzzed benchmark, as well as individual successful attacks. The methods show promise at providing insights into the ability of an LLM to operate robustly in more realistic settings.

6/12/2024

cs.CL cs.LG