AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator

2402.09742

Published 7/1/2024 by Zhihao Fan, Jialong Tang, Wei Chen, Siyuan Wang, Zhongyu Wei, Jun Xi, Fei Huang, Jingren Zhou

🤖

Abstract

Artificial intelligence has significantly advanced healthcare, particularly through large language models (LLMs) that excel in medical question answering benchmarks. However, their real-world clinical application remains limited due to the complexities of doctor-patient interactions. To address this, we introduce textbf{AI Hospital}, a multi-agent framework simulating dynamic medical interactions between emph{Doctor} as player and NPCs including emph{Patient}, emph{Examiner}, emph{Chief Physician}. This setup allows for realistic assessments of LLMs in clinical scenarios. We develop the Multi-View Medical Evaluation (MVME) benchmark, utilizing high-quality Chinese medical records and NPCs to evaluate LLMs' performance in symptom collection, examination recommendations, and diagnoses. Additionally, a dispute resolution collaborative mechanism is proposed to enhance diagnostic accuracy through iterative discussions. Despite improvements, current LLMs exhibit significant performance gaps in multi-turn interactions compared to one-step approaches. Our findings highlight the need for further research to bridge these gaps and improve LLMs' clinical diagnostic capabilities. Our data, code, and experimental results are all open-sourced at url{https://github.com/LibertFan/AI_Hospital}.

Create account to get full access

Overview

Researchers developed a multi-agent framework called "AI Hospital" to simulate dynamic medical interactions between a "Doctor" player and non-player characters (NPCs) like "Patient," "Examiner," and "Chief Physician."
This setup allows for realistic assessments of large language models (LLMs) in clinical scenarios through the Multi-View Medical Evaluation (MVME) benchmark.
The benchmark utilizes high-quality Chinese medical records and NPCs to evaluate LLMs' performance in symptom collection, examination recommendations, and diagnoses.
A dispute resolution collaborative mechanism is proposed to enhance diagnostic accuracy through iterative discussions.

Plain English Explanation

Artificial intelligence (AI) has made significant advancements in healthcare, particularly through the use of large language models (LLMs) that excel at answering medical questions. However, these models have yet to see widespread real-world clinical application due to the complexities of doctor-patient interactions.

To address this, researchers created a simulation environment called "AI Hospital" that replicates the dynamic interactions between a "Doctor" player and various non-player characters, such as a "Patient," "Examiner," and "Chief Physician." This setup allows researchers to realistically evaluate how well LLMs perform in clinical scenarios, which is crucial for understanding their potential and limitations in real-world medical settings.

The researchers developed a benchmark called the Multi-View Medical Evaluation (MVME) that uses high-quality Chinese medical records and the AI-controlled characters to assess LLMs' abilities in collecting symptoms, recommending examinations, and making diagnoses. Additionally, they proposed a dispute resolution mechanism that enables the models to engage in iterative discussions to improve the accuracy of their diagnoses.

While the researchers found improvements in the LLMs' performance, there are still significant gaps in their capabilities when it comes to multi-turn interactions compared to one-step approaches. This highlights the need for further research to bridge these gaps and enhance the clinical diagnostic capabilities of these AI models.

Technical Explanation

The researchers introduced a multi-agent framework called "AI Hospital" to simulate dynamic medical interactions between a "Doctor" player and non-player characters (NPCs) such as "Patient," "Examiner," and "Chief Physician." This setup allows for realistic assessments of large language models (LLMs) in clinical scenarios.

To evaluate the LLMs' performance, the researchers developed the Multi-View Medical Evaluation (MVME) benchmark, which utilizes high-quality Chinese medical records and the AI-controlled NPCs. The benchmark assesses the LLMs' abilities in symptom collection, examination recommendations, and diagnoses.

Additionally, the researchers proposed a dispute resolution collaborative mechanism to enhance diagnostic accuracy through iterative discussions between the LLMs and the NPCs. This mechanism aims to address the complexities of doctor-patient interactions, which have been a key limitation in the real-world clinical application of LLMs.

The results showed that while the LLMs exhibited improvements in their performance, they still exhibited significant gaps in their capabilities when it comes to multi-turn interactions compared to one-step approaches. This finding highlights the need for further research to bridge these gaps and improve the clinical diagnostic capabilities of LLMs.

Critical Analysis

The researchers acknowledge the limitations of their study, noting that the real-world clinical application of LLMs remains challenging due to the complexities of doctor-patient interactions. While the "AI Hospital" framework and the MVME benchmark provide a more realistic simulation environment, there are still concerns about the generalizability of these findings to actual clinical settings.

Furthermore, the study focuses on the Chinese medical ecosystem, and it is unclear how well the proposed methods and findings would translate to other healthcare systems and cultural contexts. Additional research is needed to assess the performance of LLMs in diverse medical settings and with different patient populations.

The researchers also highlight the need for further advancements in the collaborative dispute resolution mechanism, as the current implementation may not fully capture the nuances of human-to-human medical discussions. Exploring more sophisticated dialogue modeling and conflict resolution techniques could be a valuable avenue for future research.

Conclusion

The introduction of the "AI Hospital" framework and the MVME benchmark represents a significant step forward in the evaluation of large language models (LLMs) in clinical scenarios. By simulating dynamic medical interactions, researchers can now assess the performance of these AI models in a more realistic and comprehensive manner.

While the study demonstrates improvements in the LLMs' capabilities, it also highlights the persistent challenges in bridging the gap between their performance in one-step tasks and their ability to handle complex, multi-turn interactions. Addressing these limitations is crucial for the successful integration of LLMs in real-world clinical settings and for improving their overall diagnostic capabilities.

Continued research in this area has the potential to enhance the role of AI in healthcare, ultimately leading to more efficient and effective medical decision-making processes. The open-sourcing of the data, code, and experimental results from this study is a valuable contribution to the research community, encouraging further exploration and collaboration in this important field.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

CliBench: Multifaceted Evaluation of Large Language Models in Clinical Decisions on Diagnoses, Procedures, Lab Tests Orders and Prescriptions

Mingyu Derek Ma, Chenchen Ye, Yu Yan, Xiaoxuan Wang, Peipei Ping, Timothy S Chang, Wei Wang

The integration of Artificial Intelligence (AI), especially Large Language Models (LLMs), into the clinical diagnosis process offers significant potential to improve the efficiency and accessibility of medical care. While LLMs have shown some promise in the medical domain, their application in clinical diagnosis remains underexplored, especially in real-world clinical practice, where highly sophisticated, patient-specific decisions need to be made. Current evaluations of LLMs in this field are often narrow in scope, focusing on specific diseases or specialties and employing simplified diagnostic tasks. To bridge this gap, we introduce CliBench, a novel benchmark developed from the MIMIC IV dataset, offering a comprehensive and realistic assessment of LLMs' capabilities in clinical diagnosis. This benchmark not only covers diagnoses from a diverse range of medical cases across various specialties but also incorporates tasks of clinical significance: treatment procedure identification, lab test ordering and medication prescriptions. Supported by structured output ontologies, CliBench enables a precise and multi-granular evaluation, offering an in-depth understanding of LLM's capability on diverse clinical tasks of desired granularity. We conduct a zero-shot evaluation of leading LLMs to assess their proficiency in clinical decision-making. Our preliminary results shed light on the potential and limitations of current LLMs in clinical settings, providing valuable insights for future advancements in LLM-powered healthcare.

6/17/2024

cs.CL cs.AI cs.LG

🤖

AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments

Samuel Schmidgall, Rojin Ziaei, Carl Harris, Eduardo Reis, Jeffrey Jopling, Michael Moor

Diagnosing and managing a patient is a complex, sequential decision making process that requires physicians to obtain information -- such as which tests to perform -- and to act upon it. Recent advances in artificial intelligence (AI) and large language models (LLMs) promise to profoundly impact clinical care. However, current evaluation schemes overrely on static medical question-answering benchmarks, falling short on interactive decision-making that is required in real-life clinical work. Here, we present AgentClinic: a multimodal benchmark to evaluate LLMs in their ability to operate as agents in simulated clinical environments. In our benchmark, the doctor agent must uncover the patient's diagnosis through dialogue and active data collection. We present two open medical agent benchmarks: a multimodal image and dialogue environment, AgentClinic-NEJM, and a dialogue-only environment, AgentClinic-MedQA. We embed cognitive and implicit biases both in patient and doctor agents to emulate realistic interactions between biased agents. We find that introducing bias leads to large reductions in diagnostic accuracy of the doctor agents, as well as reduced compliance, confidence, and follow-up consultation willingness in patient agents. Evaluating a suite of state-of-the-art LLMs, we find that several models that excel in benchmarks like MedQA are performing poorly in AgentClinic-MedQA. We find that the LLM used in the patient agent is an important factor for performance in the AgentClinic benchmark. We show that both having limited interactions as well as too many interaction reduces diagnostic accuracy in doctor agents. The code and data for this work is publicly available at https://AgentClinic.github.io.

6/3/2024

cs.HC cs.CL

Large Language Models in Healthcare: A Comprehensive Benchmark

Andrew Liu, Hongjian Zhou, Yining Hua, Omid Rohanian, Anshul Thakur, Lei Clifton, David A. Clifton

The adoption of large language models (LLMs) to assist clinicians has attracted remarkable attention. Existing works mainly adopt the close-ended question-answering (QA) task with answer options for evaluation. However, many clinical decisions involve answering open-ended questions without pre-set options. To better understand LLMs in the clinic, we construct a benchmark ClinicBench. We first collect eleven existing datasets covering diverse clinical language generation, understanding, and reasoning tasks. Furthermore, we construct six novel datasets and complex clinical tasks that are close to real-world practice, i.e., referral QA, treatment recommendation, hospitalization (long document) summarization, patient education, pharmacology QA and drug interaction for emerging drugs. We conduct an extensive evaluation of twenty-two LLMs under both zero-shot and few-shot settings. Finally, we invite medical experts to evaluate the clinical usefulness of LLMs.

6/27/2024

cs.CL cs.AI

💬

Evaluating large language models in medical applications: a survey

Xiaolan Chen, Jiayang Xiang, Shanfu Lu, Yexin Liu, Mingguang He, Danli Shi

Large language models (LLMs) have emerged as powerful tools with transformative potential across numerous domains, including healthcare and medicine. In the medical domain, LLMs hold promise for tasks ranging from clinical decision support to patient education. However, evaluating the performance of LLMs in medical contexts presents unique challenges due to the complex and critical nature of medical information. This paper provides a comprehensive overview of the landscape of medical LLM evaluation, synthesizing insights from existing studies and highlighting evaluation data sources, task scenarios, and evaluation methods. Additionally, it identifies key challenges and opportunities in medical LLM evaluation, emphasizing the need for continued research and innovation to ensure the responsible integration of LLMs into clinical practice.

5/14/2024

cs.CL cs.AI