Answering real-world clinical questions using large language model based systems

2407.00541

Published 7/2/2024 by Yen Sia Low (Atropos Health, New York NY, USA), Michael L. Jackson (Atropos Health, New York NY, USA), Rebecca J. Hyde (Atropos Health, New York NY, USA), Robert E. Brown (Atropos Health and 92 others

cs.CL cs.AI cs.IR

💬

Abstract

Evidence to guide healthcare decisions is often limited by a lack of relevant and trustworthy literature as well as difficulty in contextualizing existing research for a specific patient. Large language models (LLMs) could potentially address both challenges by either summarizing published literature or generating new studies based on real-world data (RWD). We evaluated the ability of five LLM-based systems in answering 50 clinical questions and had nine independent physicians review the responses for relevance, reliability, and actionability. As it stands, general-purpose LLMs (ChatGPT-4, Claude 3 Opus, Gemini Pro 1.5) rarely produced answers that were deemed relevant and evidence-based (2% - 10%). In contrast, retrieval augmented generation (RAG)-based and agentic LLM systems produced relevant and evidence-based answers for 24% (OpenEvidence) to 58% (ChatRWD) of questions. Only the agentic ChatRWD was able to answer novel questions compared to other LLMs (65% vs. 0-9%). These results suggest that while general-purpose LLMs should not be used as-is, a purpose-built system for evidence summarization based on RAG and one for generating novel evidence working synergistically would improve availability of pertinent evidence for patient care.

Create account to get full access

Overview

The paper explores the potential of large language models (LLMs) to address the challenges of limited and difficult-to-contextualize healthcare research literature.
The researchers evaluated the ability of five LLM-based systems to answer 50 clinical questions, with nine physicians reviewing the responses for relevance, reliability, and actionability.
The results suggest that while general-purpose LLMs perform poorly, purpose-built systems combining retrieval-augmented generation (RAG) and novel evidence generation could significantly improve the availability of pertinent evidence for patient care.

Plain English Explanation

Healthcare providers often struggle to find the right information to guide their decisions for individual patients. This is because the existing research literature may not be directly relevant or trustworthy enough. Large language models (LLMs) could potentially help by either summarizing the published research or generating new studies based on real-world data.

The researchers in this study tested the ability of five different LLM-based systems to answer 50 clinical questions. They had nine experienced doctors review the responses to see how relevant, reliable, and useful they were for making healthcare decisions.

The general-purpose LLMs, like ChatGPT-4 and Claude 3 Opus, performed poorly, with only 2-10% of their answers being considered relevant and evidence-based. However, the systems that combined retrieval-augmented generation (RAG) and the ability to generate novel evidence, like ChatRWD, did much better, producing relevant and evidence-based answers for 24-58% of the questions.

Importantly, the ChatRWD system was even able to answer completely new questions, which the other LLMs could not do. This suggests that a combination of summarizing existing research and generating new evidence could greatly improve the availability of useful information for healthcare providers.

Technical Explanation

The researchers evaluated the performance of five LLM-based systems in answering 50 clinical questions:

ChatGPT-4, a general-purpose LLM
Claude 3 Opus, another general-purpose LLM
OpenEvidence, a retrieval-augmented generation (RAG) system
ChatRWD, an agentic LLM that can generate novel evidence
Gemini Pro 1.5, a general-purpose LLM

Nine independent physicians reviewed the responses from these systems and assessed them for relevance, reliability, and actionability. The results showed that the general-purpose LLMs (ChatGPT-4, Claude 3 Opus, and Gemini Pro 1.5) rarely produced answers that were deemed relevant and evidence-based (2-10%).

In contrast, the RAG-based and agentic LLM systems performed much better. OpenEvidence produced relevant and evidence-based answers for 24% of the questions, while ChatRWD was able to do so for 58% of the questions.

Importantly, only the ChatRWD system was able to answer novel questions that were not covered in the existing research literature, doing so for 65% of the questions. The other LLMs were limited to 0-9% in this regard.

Critical Analysis

The paper highlights the limitations of general-purpose LLMs in the healthcare domain and the potential benefits of more specialized systems that combine research summarization and novel evidence generation.

However, the study does not provide detailed information on the specific architectures, training data, or other technical details of the LLM-based systems tested. This makes it difficult to fully assess the generalizability of the findings or to understand the trade-offs and design choices that led to the observed performance differences.

Additionally, the study only evaluated the systems on a relatively small set of 50 clinical questions, which may not be representative of the full range of challenges healthcare providers face. Comprehensive surveys of LLMs in healthcare and medicine could provide a more holistic understanding of the current state of the field and the potential future directions.

Conclusion

This study suggests that while general-purpose LLMs are not yet ready to be used directly in healthcare decision-making, a combination of research summarization and novel evidence generation could significantly improve the availability of relevant and trustworthy information for patient care.

By developing purpose-built LLM systems that can effectively leverage real-world data and existing literature, healthcare providers may be able to access the evidence they need to make more informed decisions, ultimately leading to better patient outcomes.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Large Language Models in Healthcare: A Comprehensive Benchmark

Andrew Liu, Hongjian Zhou, Yining Hua, Omid Rohanian, Anshul Thakur, Lei Clifton, David A. Clifton

The adoption of large language models (LLMs) to assist clinicians has attracted remarkable attention. Existing works mainly adopt the close-ended question-answering (QA) task with answer options for evaluation. However, many clinical decisions involve answering open-ended questions without pre-set options. To better understand LLMs in the clinic, we construct a benchmark ClinicBench. We first collect eleven existing datasets covering diverse clinical language generation, understanding, and reasoning tasks. Furthermore, we construct six novel datasets and complex clinical tasks that are close to real-world practice, i.e., referral QA, treatment recommendation, hospitalization (long document) summarization, patient education, pharmacology QA and drug interaction for emerging drugs. We conduct an extensive evaluation of twenty-two LLMs under both zero-shot and few-shot settings. Finally, we invite medical experts to evaluate the clinical usefulness of LLMs.

6/27/2024

cs.CL cs.AI

A Survey on Large Language Models from General Purpose to Medical Applications: Datasets, Methodologies, and Evaluations

Jinqiang Wang, Huansheng Ning, Yi Peng, Qikai Wei, Daniel Tesfai, Wenwei Mao, Tao Zhu, Runhe Huang

Large Language Models (LLMs) have demonstrated surprising performance across various natural language processing tasks. Recently, medical LLMs enhanced with domain-specific knowledge have exhibited excellent capabilities in medical consultation and diagnosis. These models can smoothly simulate doctor-patient dialogues and provide professional medical advice. Most medical LLMs are developed through continued training of open-source general LLMs, which require significantly fewer computational resources than training LLMs from scratch. Additionally, this approach offers better protection of patient privacy compared to API-based solutions. This survey systematically explores how to train medical LLMs based on general LLMs. It covers: (a) how to acquire training corpus and construct customized medical training sets, (b) how to choose a appropriate training paradigm, (c) how to choose a suitable evaluation benchmark, and (d) existing challenges and promising future research directions are discussed. This survey can provide guidance for the development of LLMs focused on various medical applications, such as medical education, diagnostic planning, and clinical assistants.

6/18/2024

cs.CL cs.AI

💬

A Survey of Large Language Models for Healthcare: from Data, Technology, and Applications to Accountability and Ethics

Kai He, Rui Mao, Qika Lin, Yucheng Ruan, Xiang Lan, Mengling Feng, Erik Cambria

The utilization of large language models (LLMs) in the Healthcare domain has generated both excitement and concern due to their ability to effectively respond to freetext queries with certain professional knowledge. This survey outlines the capabilities of the currently developed LLMs for Healthcare and explicates their development process, with the aim of providing an overview of the development roadmap from traditional Pretrained Language Models (PLMs) to LLMs. Specifically, we first explore the potential of LLMs to enhance the efficiency and effectiveness of various Healthcare applications highlighting both the strengths and limitations. Secondly, we conduct a comparison between the previous PLMs and the latest LLMs, as well as comparing various LLMs with each other. Then we summarize related Healthcare training data, training methods, optimization strategies, and usage. Finally, the unique concerns associated with deploying LLMs in Healthcare settings are investigated, particularly regarding fairness, accountability, transparency and ethics. Our survey provide a comprehensive investigation from perspectives of both computer science and Healthcare specialty. Besides the discussion about Healthcare concerns, we supports the computer science community by compiling a collection of open source resources, such as accessible datasets, the latest methodologies, code implementations, and evaluation benchmarks in the Github. Summarily, we contend that a significant paradigm shift is underway, transitioning from PLMs to LLMs. This shift encompasses a move from discriminative AI approaches to generative AI approaches, as well as a shift from model-centered methodologies to data-centered methodologies. Also, we determine that the biggest obstacle of using LLMs in Healthcare are fairness, accountability, transparency and ethics.

6/12/2024

cs.CL

💬

A Survey of Large Language Models in Medicine: Progress, Application, and Challenge

Hongjian Zhou, Fenglin Liu, Boyang Gu, Xinyu Zou, Jinfa Huang, Jinge Wu, Yiru Li, Sam S. Chen, Peilin Zhou, Junling Liu, Yining Hua, Chengfeng Mao, Chenyu You, Xian Wu, Yefeng Zheng, Lei Clifton, Zheng Li, Jiebo Luo, David A. Clifton

Large language models (LLMs), such as ChatGPT, have received substantial attention due to their capabilities for understanding and generating human language. While there has been a burgeoning trend in research focusing on the employment of LLMs in supporting different medical tasks (e.g., enhancing clinical diagnostics and providing medical education), a review of these efforts, particularly their development, practical applications, and outcomes in medicine, remains scarce. Therefore, this review aims to provide a detailed overview of the development and deployment of LLMs in medicine, including the challenges and opportunities they face. In terms of development, we provide a detailed introduction to the principles of existing medical LLMs, including their basic model structures, number of parameters, and sources and scales of data used for model development. It serves as a guide for practitioners in developing medical LLMs tailored to their specific needs. In terms of deployment, we offer a comparison of the performance of different LLMs across various medical tasks, and further compare them with state-of-the-art lightweight models, aiming to provide an understanding of the advantages and limitations of LLMs in medicine. Overall, in this review, we address the following questions: 1) What are the practices for developing medical LLMs 2) How to measure the medical task performance of LLMs in a medical setting? 3) How have medical LLMs been employed in real-world practice? 4) What challenges arise from the use of medical LLMs? and 5) How to more effectively develop and deploy medical LLMs? By answering these questions, this review aims to provide insights into the opportunities for LLMs in medicine and serve as a practical resource. We also maintain a regularly updated list of practical guides on medical LLMs at: https://github.com/AI-in-Health/MedLLMsPracticalGuide.

5/16/2024

cs.CL cs.AI