AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments

2405.07960

Published 6/3/2024 by Samuel Schmidgall, Rojin Ziaei, Carl Harris, Eduardo Reis, Jeffrey Jopling, Michael Moor

🤖

Abstract

Diagnosing and managing a patient is a complex, sequential decision making process that requires physicians to obtain information -- such as which tests to perform -- and to act upon it. Recent advances in artificial intelligence (AI) and large language models (LLMs) promise to profoundly impact clinical care. However, current evaluation schemes overrely on static medical question-answering benchmarks, falling short on interactive decision-making that is required in real-life clinical work. Here, we present AgentClinic: a multimodal benchmark to evaluate LLMs in their ability to operate as agents in simulated clinical environments. In our benchmark, the doctor agent must uncover the patient's diagnosis through dialogue and active data collection. We present two open medical agent benchmarks: a multimodal image and dialogue environment, AgentClinic-NEJM, and a dialogue-only environment, AgentClinic-MedQA. We embed cognitive and implicit biases both in patient and doctor agents to emulate realistic interactions between biased agents. We find that introducing bias leads to large reductions in diagnostic accuracy of the doctor agents, as well as reduced compliance, confidence, and follow-up consultation willingness in patient agents. Evaluating a suite of state-of-the-art LLMs, we find that several models that excel in benchmarks like MedQA are performing poorly in AgentClinic-MedQA. We find that the LLM used in the patient agent is an important factor for performance in the AgentClinic benchmark. We show that both having limited interactions as well as too many interaction reduces diagnostic accuracy in doctor agents. The code and data for this work is publicly available at https://AgentClinic.github.io.

Create account to get full access

Overview

Introduces a new multimodal benchmark called AgentClinic to evaluate large language models (LLMs) in simulated clinical environments
Focuses on the ability of AI agents to uncover patient diagnoses through dialogue and active data collection
Embeds cognitive and implicit biases in patient and doctor agents to emulate realistic interactions
Finds that introducing bias significantly reduces diagnostic accuracy, compliance, and willingness for follow-up in the agents
Evaluates a suite of state-of-the-art LLMs, revealing that some models performing well on static medical benchmarks struggle in the interactive AgentClinic setting

Plain English Explanation

Diagnosing and managing a patient's health is a complex process that requires doctors to gather information, such as which medical tests to perform, and then act on that information. Recent advances in artificial intelligence (AI) and large language models (LLMs) promise to have a significant impact on clinical care.

However, current methods for evaluating these AI systems often rely on static medical question-answering tests, which don't capture the interactive decision-making required in real-life clinical work. To address this, the researchers have developed a new benchmark called AgentClinic that simulates clinical environments where AI agents (or "doctors") must uncover a patient's diagnosis through dialogue and active data collection.

The researchers have embedded both cognitive and implicit biases into the patient and doctor agents to make the interactions more realistic. They find that introducing these biases leads to large reductions in the doctor agents' diagnostic accuracy, as well as reduced compliance, confidence, and willingness for follow-up consultation in the patient agents.

When evaluating a range of state-of-the-art LLMs, the researchers discover that several models that perform well on traditional medical benchmarks, such as MedQA, struggle in the more interactive AgentClinic setting. They also find that the specific LLM used in the patient agent is an important factor for the doctor agent's performance.

The researchers further show that both limited interactions as well as too many interactions can reduce the doctor agents' diagnostic accuracy. Overall, the AgentClinic benchmark provides a more comprehensive way to evaluate the abilities of AI systems in clinical decision-making, helping to ensure that these technologies can truly assist doctors and patients.

Technical Explanation

The researchers have developed a new multimodal benchmark called AgentClinic to evaluate the ability of LLMs to operate as agents in simulated clinical environments. In this benchmark, the doctor agent must uncover the patient's diagnosis through a dialogue-driven, active data collection process.

The researchers have created two open benchmarks within AgentClinic: AgentClinic-NEJM, a multimodal image and dialogue environment, and AgentClinic-MedQA, a dialogue-only environment.

To emulate realistic interactions, the researchers have embedded both cognitive and implicit biases in the patient and doctor agents. Their experiments reveal that introducing these biases leads to significant reductions in the doctor agents' diagnostic accuracy, as well as reduced compliance, confidence, and follow-up consultation willingness in the patient agents.

Evaluating a suite of state-of-the-art LLMs, the researchers find that several models that excel in benchmarks like MedQA perform poorly in the AgentClinic-MedQA setting. They also discover that the LLM used in the patient agent is an important factor for the doctor agent's performance in the AgentClinic benchmark.

Additionally, the researchers show that both limited interactions as well as too many interactions can reduce the doctor agents' diagnostic accuracy, highlighting the importance of finding the right balance in the interactive decision-making process.

Critical Analysis

The researchers have developed a valuable benchmark for evaluating the abilities of LLMs in clinical decision-making, which is an important step forward in ensuring that these AI systems can effectively assist doctors and patients.

One potential limitation of the study is the use of simulated environments and artificial biases, which may not fully capture the complexity and nuance of real-world clinical interactions. Further research is needed to validate the findings in actual clinical settings.

Additionally, the paper does not explore the potential reasons why some high-performing LLMs on static medical benchmarks struggle in the more interactive AgentClinic setting. Investigating the specific capabilities and limitations of these models could provide important insights for improving their performance in interactive clinical scenarios.

The researchers also acknowledge that the AgentClinic benchmark is still a simplified representation of the clinical decision-making process, and there are many other factors, such as patient preferences, ethical considerations, and legal constraints, that would need to be incorporated for a more comprehensive evaluation.

Overall, the AgentClinic benchmark represents a significant advancement in the field of AI-assisted clinical decision-making, and the insights from this study can help guide the development of more robust and reliable AI systems for healthcare.

Conclusion

The AgentClinic benchmark introduced in this paper represents an important step forward in the evaluation of LLMs for clinical decision-making. By simulating interactive clinical environments and incorporating biases, the researchers have created a more realistic and comprehensive assessment of these AI systems' abilities to uncover patient diagnoses through dialogue and active data collection.

The findings reveal that many state-of-the-art LLMs that excel on traditional medical benchmarks struggle in the more interactive AgentClinic setting, highlighting the need for more sophisticated evaluation approaches that capture the complexities of real-world clinical practice. The researchers' insights into the impact of biases and the balance of interactions on diagnostic accuracy provide valuable guidance for the development of AI-powered clinical decision support tools that can truly assist doctors and patients.

As the field of healthcare AI continues to evolve, the AgentClinic benchmark and the lessons learned from this study will be crucial in ensuring that these technologies are carefully and rigorously evaluated before being deployed in clinical settings, ultimately leading to safer and more effective patient care.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🤖

AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator

Zhihao Fan, Jialong Tang, Wei Chen, Siyuan Wang, Zhongyu Wei, Jun Xi, Fei Huang, Jingren Zhou

Artificial intelligence has significantly advanced healthcare, particularly through large language models (LLMs) that excel in medical question answering benchmarks. However, their real-world clinical application remains limited due to the complexities of doctor-patient interactions. To address this, we introduce textbf{AI Hospital}, a multi-agent framework simulating dynamic medical interactions between emph{Doctor} as player and NPCs including emph{Patient}, emph{Examiner}, emph{Chief Physician}. This setup allows for realistic assessments of LLMs in clinical scenarios. We develop the Multi-View Medical Evaluation (MVME) benchmark, utilizing high-quality Chinese medical records and NPCs to evaluate LLMs' performance in symptom collection, examination recommendations, and diagnoses. Additionally, a dispute resolution collaborative mechanism is proposed to enhance diagnostic accuracy through iterative discussions. Despite improvements, current LLMs exhibit significant performance gaps in multi-turn interactions compared to one-step approaches. Our findings highlight the need for further research to bridge these gaps and improve LLMs' clinical diagnostic capabilities. Our data, code, and experimental results are all open-sourced at url{https://github.com/LibertFan/AI_Hospital}.

7/1/2024

cs.CL

CliBench: Multifaceted Evaluation of Large Language Models in Clinical Decisions on Diagnoses, Procedures, Lab Tests Orders and Prescriptions

Mingyu Derek Ma, Chenchen Ye, Yu Yan, Xiaoxuan Wang, Peipei Ping, Timothy S Chang, Wei Wang

The integration of Artificial Intelligence (AI), especially Large Language Models (LLMs), into the clinical diagnosis process offers significant potential to improve the efficiency and accessibility of medical care. While LLMs have shown some promise in the medical domain, their application in clinical diagnosis remains underexplored, especially in real-world clinical practice, where highly sophisticated, patient-specific decisions need to be made. Current evaluations of LLMs in this field are often narrow in scope, focusing on specific diseases or specialties and employing simplified diagnostic tasks. To bridge this gap, we introduce CliBench, a novel benchmark developed from the MIMIC IV dataset, offering a comprehensive and realistic assessment of LLMs' capabilities in clinical diagnosis. This benchmark not only covers diagnoses from a diverse range of medical cases across various specialties but also incorporates tasks of clinical significance: treatment procedure identification, lab test ordering and medication prescriptions. Supported by structured output ontologies, CliBench enables a precise and multi-granular evaluation, offering an in-depth understanding of LLM's capability on diverse clinical tasks of desired granularity. We conduct a zero-shot evaluation of leading LLMs to assess their proficiency in clinical decision-making. Our preliminary results shed light on the potential and limitations of current LLMs in clinical settings, providing valuable insights for future advancements in LLM-powered healthcare.

6/17/2024

cs.CL cs.AI cs.LG

ClinicalLab: Aligning Agents for Multi-Departmental Clinical Diagnostics in the Real World

Weixiang Yan, Haitian Liu, Tengxiao Wu, Qian Chen, Wen Wang, Haoyuan Chai, Jiayi Wang, Weishan Zhao, Yixin Zhang, Renjun Zhang, Li Zhu

LLMs have achieved significant performance progress in various NLP applications. However, LLMs still struggle to meet the strict requirements for accuracy and reliability in the medical field and face many challenges in clinical applications. Existing clinical diagnostic evaluation benchmarks for evaluating medical agents powered by LLMs have severe limitations. Firstly, most existing medical evaluation benchmarks face the risk of data leakage or contamination. Secondly, existing benchmarks often neglect the characteristics of multiple departments and specializations in modern medical practice. Thirdly, existing evaluation methods are limited to multiple-choice questions, which do not align with the real-world diagnostic scenarios. Lastly, existing evaluation methods lack comprehensive evaluations of end-to-end real clinical scenarios. These limitations in benchmarks in turn obstruct advancements of LLMs and agents for medicine. To address these limitations, we introduce ClinicalLab, a comprehensive clinical diagnosis agent alignment suite. ClinicalLab includes ClinicalBench, an end-to-end multi-departmental clinical diagnostic evaluation benchmark for evaluating medical agents and LLMs. ClinicalBench is based on real cases that cover 24 departments and 150 diseases. ClinicalLab also includes four novel metrics (ClinicalMetrics) for evaluating the effectiveness of LLMs in clinical diagnostic tasks. We evaluate 17 LLMs and find that their performance varies significantly across different departments. Based on these findings, in ClinicalLab, we propose ClinicalAgent, an end-to-end clinical agent that aligns with real-world clinical diagnostic practices. We systematically investigate the performance and applicable scenarios of variants of ClinicalAgent on ClinicalBench. Our findings demonstrate the importance of aligning with modern medical practices in designing medical agents.

6/21/2024

cs.CL cs.AI

Autonomous Artificial Intelligence Agents for Clinical Decision Making in Oncology

Dyke Ferber, Omar S. M. El Nahhas, Georg Wolflein, Isabella C. Wiest, Jan Clusmann, Marie-Elisabeth Le{ss}man, Sebastian Foersch, Jacqueline Lammert, Maximilian Tschochohei, Dirk Jager, Manuel Salto-Tellez, Nikolaus Schultz, Daniel Truhn, Jakob Nikolas Kather

Multimodal artificial intelligence (AI) systems have the potential to enhance clinical decision-making by interpreting various types of medical data. However, the effectiveness of these models across all medical fields is uncertain. Each discipline presents unique challenges that need to be addressed for optimal performance. This complexity is further increased when attempting to integrate different fields into a single model. Here, we introduce an alternative approach to multimodal medical AI that utilizes the generalist capabilities of a large language model (LLM) as a central reasoning engine. This engine autonomously coordinates and deploys a set of specialized medical AI tools. These tools include text, radiology and histopathology image interpretation, genomic data processing, web searches, and document retrieval from medical guidelines. We validate our system across a series of clinical oncology scenarios that closely resemble typical patient care workflows. We show that the system has a high capability in employing appropriate tools (97%), drawing correct conclusions (93.6%), and providing complete (94%), and helpful (89.2%) recommendations for individual patient cases while consistently referencing relevant literature (82.5%) upon instruction. This work provides evidence that LLMs can effectively plan and execute domain-specific models to retrieve or synthesize new information when used as autonomous agents. This enables them to function as specialist, patient-tailored clinical assistants. It also simplifies regulatory compliance by allowing each component tool to be individually validated and approved. We believe, that our work can serve as a proof-of-concept for more advanced LLM-agents in the medical domain.

4/9/2024

cs.AI