Natural Language Programming in Medicine: Administering Evidence Based Clinical Workflows with Autonomous Agents Powered by Generative Large Language Models

Read original: arXiv:2401.02851 - Published 8/23/2024 by Akhil Vaid, Joshua Lampert, Juhee Lee, Ashwin Sawant, Donald Apakama, Ankit Sakhuja, Ali Soroush, Sarah Bick, Ethan Abbott, Hernando Gomez and 20 others

🌿

Overview

Large language models (LLMs) show promise in healthcare, passing medical exams and providing clinical knowledge.
However, using LLMs as information retrieval tools faces challenges like outdated data, resource demands, and occasional incorrect information generation.
This study assessed LLMs' potential as autonomous agents in a simulated medical center, using real-world clinical cases across specialties.
Both proprietary and open-source LLMs were evaluated, with Retrieval Augmented Generation (RAG) enhancing contextual relevance.

Plain English Explanation

Large language models (LLMs) are a type of artificial intelligence that can generate human-like text. These models have shown they can perform tasks like passing medical licensing exams and providing clinical information, indicating they could be helpful in healthcare.

However, there are some challenges with using LLMs as search tools to find medical information. The data they are trained on can become outdated quickly, they require a lot of computing power to run, and they can sometimes generate incorrect information.

This study looked at whether LLMs could act as autonomous agents, making their own decisions, in a simulated medical center setting. The researchers tested both proprietary (private) and open-source LLMs, including an approach called Retrieval Augmented Generation (RAG) that helps the models provide more relevant information.

The proprietary models, especially the GPT-4 model, generally performed better than the open-source models. They were more likely to follow medical guidelines and provide more accurate responses, especially when using RAG. But the researchers needed expert clinicians to manually check the models' outputs to make sure they were reliable.

The study highlights that natural language programming is the best way to adjust the behavior of these LLMs, allowing for precise changes through tailored prompts and real-world interactions. This suggests LLMs could significantly enhance clinical decision-making, but there needs to be ongoing involvement from experts to ensure they are working reliably and effectively in healthcare settings.

Technical Explanation

The researchers evaluated the potential of generative large language models (LLMs) to function as autonomous agents in a simulated tertiary care medical center, using real-world clinical cases across multiple specialties. Both proprietary and open-source LLMs were assessed, with Retrieval Augmented Generation (RAG) used to enhance the contextual relevance of the models' outputs.

The proprietary models, particularly GPT-4, generally outperformed the open-source models in terms of improved guideline adherence and more accurate responses, especially when leveraging the RAG technique. The manual evaluation by expert clinicians was crucial in validating the models' outputs, underscoring the importance of human oversight in LLM operation within healthcare settings.

Furthermore, the study emphasizes natural language programming (NLP) as the appropriate paradigm for modifying model behavior, allowing for precise adjustments through tailored prompts and real-world interactions. This approach highlights the potential of LLMs to significantly enhance and supplement clinical decision-making, while also emphasizing the value of continuous expert involvement and the flexibility of NLP to ensure their reliability and effectiveness in healthcare.

Critical Analysis

The study provides valuable insights into the potential and limitations of using large language models (LLMs) as autonomous agents in a healthcare setting. The researchers' focus on real-world clinical cases and the involvement of expert clinicians to validate the models' outputs is a strength, as it helps address the challenge of LLMs generating incorrect information.

However, the study does not delve deeper into the specific challenges faced when using LLMs in healthcare, such as the potential for bias in the underlying data or the difficulty in ensuring patient privacy and data security. Additionally, the study does not explore the long-term implications of relying on LLMs for clinical decision-making, such as the impact on the medical profession or the potential for over-reliance on the technology.

Further research is needed to fully understand the limitations and potential risks of using LLMs in healthcare, as well as to explore alternative approaches or combinations of technologies that could enhance the reliability and effectiveness of these models in a clinical setting. Comprehensive surveys of large language models in medicine and progress in their application may provide additional insights and context for this study.

Conclusion

This study highlights the potential of large language models (LLMs) to enhance and supplement clinical decision-making in healthcare settings, but also underscores the importance of human oversight and the need for reliable, accurate information generation.

The researchers' findings suggest that proprietary LLMs, especially GPT-4, can outperform open-source models in terms of guideline adherence and response accuracy, particularly when using Retrieval Augmented Generation (RAG) techniques. However, the manual evaluation by expert clinicians remains crucial in ensuring the models' outputs are valid and reliable.

The study emphasizes natural language programming as the appropriate approach for modifying LLM behavior, allowing for precise adjustments through tailored prompts and real-world interactions. This flexibility could be key to unlocking the full potential of LLMs in healthcare, while also maintaining the necessary oversight and safeguards to protect patient care and outcomes.

As large language models continue to advance and find new applications, this research highlights the importance of carefully evaluating their capabilities and limitations, especially in high-stakes domains like healthcare. Ongoing collaboration between AI researchers and medical experts will be essential to ensuring these powerful tools are leveraged responsibly and effectively to improve patient outcomes.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🌿

Natural Language Programming in Medicine: Administering Evidence Based Clinical Workflows with Autonomous Agents Powered by Generative Large Language Models

Akhil Vaid, Joshua Lampert, Juhee Lee, Ashwin Sawant, Donald Apakama, Ankit Sakhuja, Ali Soroush, Sarah Bick, Ethan Abbott, Hernando Gomez, Michael Hadley, Denise Lee, Isotta Landi, Son Q Duong, Nicole Bussola, Ismail Nabeel, Silke Muehlstedt, Silke Muehlstedt, Robert Freeman, Patricia Kovatch, Brendan Carr, Fei Wang, Benjamin Glicksberg, Edgar Argulian, Stamatios Lerakis, Rohan Khera, David L. Reich, Monica Kraft, Alexander Charney, Girish Nadkarni

Generative Large Language Models (LLMs) hold significant promise in healthcare, demonstrating capabilities such as passing medical licensing exams and providing clinical knowledge. However, their current use as information retrieval tools is limited by challenges like data staleness, resource demands, and occasional generation of incorrect information. This study assessed the potential of LLMs to function as autonomous agents in a simulated tertiary care medical center, using real-world clinical cases across multiple specialties. Both proprietary and open-source LLMs were evaluated, with Retrieval Augmented Generation (RAG) enhancing contextual relevance. Proprietary models, particularly GPT-4, generally outperformed open-source models, showing improved guideline adherence and more accurate responses with RAG. The manual evaluation by expert clinicians was crucial in validating models' outputs, underscoring the importance of human oversight in LLM operation. Further, the study emphasizes Natural Language Programming (NLP) as the appropriate paradigm for modifying model behavior, allowing for precise adjustments through tailored prompts and real-world interactions. This approach highlights the potential of LLMs to significantly enhance and supplement clinical decision-making, while also emphasizing the value of continuous expert involvement and the flexibility of NLP to ensure their reliability and effectiveness in healthcare settings.

8/23/2024

💬

Answering real-world clinical questions using large language model based systems

Yen Sia Low (Atropos Health, New York NY, USA), Michael L. Jackson (Atropos Health, New York NY, USA), Rebecca J. Hyde (Atropos Health, New York NY, USA), Robert E. Brown (Atropos Health, New York NY, USA), Neil M. Sanghavi (Atropos Health, New York NY, USA), Julian D. Baldwin (Atropos Health, New York NY, USA), C. William Pike (Atropos Health, New York NY, USA), Jananee Muralidharan (Atropos Health, New York NY, USA), Gavin Hui (Atropos Health, New York NY, USA, Department of Medicine, University of California, Los Angeles CA, USA), Natasha Alexander (Department of Pediatrics, The Hospital for Sick Children, Toronto ON, Canada), Hadeel Hassan (Department of Pediatrics, The Hospital for Sick Children, Toronto ON, Canada), Rahul V. Nene (Department of Emergency Medicine, University of California, San Diego CA, USA), Morgan Pike (Department of Emergency Medicine, University of Michigan, Ann Arbor MI, USA), Courtney J. Pokrzywa (Department of Surgery, Columbia University, New York NY, USA), Shivam Vedak (Center for Biomedical Informatics Research, Stanford University, Stanford CA, USA), Adam Paul Yan (Department of Pediatrics, The Hospital for Sick Children, Toronto ON, Canada), Dong-han Yao (Center for Biomedical Informatics Research, Stanford University, Stanford CA, USA), Amy R. Zipursky (Department of Pediatrics, The Hospital for Sick Children, Toronto ON, Canada), Christina Dinh (Atropos Health, New York NY, USA), Philip Ballentine (Atropos Health, New York NY, USA), Dan C. Derieg (Atropos Health, New York NY, USA), Vladimir Polony (Atropos Health, New York NY, USA), Rehan N. Chawdry (Atropos Health, New York NY, USA), Jordan Davies (Atropos Health, New York NY, USA), Brigham B. Hyde (Atropos Health, New York NY, USA), Nigam H. Shah (Atropos Health, New York NY, USA, Center for Biomedical Informatics Research, Stanford University, Stanford CA, USA), Saurabh Gombar (Atropos Health, New York NY, USA, Department of Pathology, Stanford University, Stanford CA, USA)

Evidence to guide healthcare decisions is often limited by a lack of relevant and trustworthy literature as well as difficulty in contextualizing existing research for a specific patient. Large language models (LLMs) could potentially address both challenges by either summarizing published literature or generating new studies based on real-world data (RWD). We evaluated the ability of five LLM-based systems in answering 50 clinical questions and had nine independent physicians review the responses for relevance, reliability, and actionability. As it stands, general-purpose LLMs (ChatGPT-4, Claude 3 Opus, Gemini Pro 1.5) rarely produced answers that were deemed relevant and evidence-based (2% - 10%). In contrast, retrieval augmented generation (RAG)-based and agentic LLM systems produced relevant and evidence-based answers for 24% (OpenEvidence) to 58% (ChatRWD) of questions. Only the agentic ChatRWD was able to answer novel questions compared to other LLMs (65% vs. 0-9%). These results suggest that while general-purpose LLMs should not be used as-is, a purpose-built system for evidence summarization based on RAG and one for generating novel evidence working synergistically would improve availability of pertinent evidence for patient care.

7/2/2024

Beyond Generative Artificial Intelligence: Roadmap for Natural Language Generation

Mar'ia Mir'o Maestre, Iv'an Mart'inez-Murillo, Tania J. Martin, Borja Navarro-Colorado, Antonio Ferr'andez, Armando Su'arez Cueto, Elena Lloret

Generative Artificial Intelligence has grown exponentially as a result of Large Language Models (LLMs). This has been possible because of the impressive performance of deep learning methods created within the field of Natural Language Processing (NLP) and its subfield Natural Language Generation (NLG), which is the focus of this paper. Within the growing LLM family are the popular GPT-4, Bard and more specifically, tools such as ChatGPT have become a benchmark for other LLMs when solving most of the tasks involved in NLG research. This scenario poses new questions about the next steps for NLG and how the field can adapt and evolve to deal with new challenges in the era of LLMs. To address this, the present paper conducts a review of a representative sample of surveys recently published in NLG. By doing so, we aim to provide the scientific community with a research roadmap to identify which NLG aspects are still not suitably addressed by LLMs, as well as suggest future lines of research that should be addressed going forward.

7/16/2024

💬

Towards Leveraging Large Language Models for Automated Medical Q&A Evaluation

Jack Krolik, Herprit Mahal, Feroz Ahmad, Gaurav Trivedi, Bahador Saket

This paper explores the potential of using Large Language Models (LLMs) to automate the evaluation of responses in medical Question and Answer (Q&A) systems, a crucial form of Natural Language Processing. Traditionally, human evaluation has been indispensable for assessing the quality of these responses. However, manual evaluation by medical professionals is time-consuming and costly. Our study examines whether LLMs can reliably replicate human evaluations by using questions derived from patient data, thereby saving valuable time for medical experts. While the findings suggest promising results, further research is needed to address more specific or complex questions that were beyond the scope of this initial investigation.

9/4/2024