Retrieving Evidence from EHRs with LLMs: Possibilities and Challenges

2309.04550

Published 6/12/2024 by Hiba Ahsan, Denis Jered McInerney, Jisoo Kim, Christopher Potter, Geoffrey Young, Silvio Amir, Byron C. Wallace

cs.CL

👀

Abstract

Unstructured data in Electronic Health Records (EHRs) often contains critical information -- complementary to imaging -- that could inform radiologists' diagnoses. But the large volume of notes often associated with patients together with time constraints renders manually identifying relevant evidence practically infeasible. In this work we propose and evaluate a zero-shot strategy for using LLMs as a mechanism to efficiently retrieve and summarize unstructured evidence in patient EHR relevant to a given query. Our method entails tasking an LLM to infer whether a patient has, or is at risk of, a particular condition on the basis of associated notes; if so, we ask the model to summarize the supporting evidence. Under expert evaluation, we find that this LLM-based approach provides outputs consistently preferred to a pre-LLM information retrieval baseline. Manual evaluation is expensive, so we also propose and validate a method using an LLM to evaluate (other) LLM outputs for this task, allowing us to scale up evaluation. Our findings indicate the promise of LLMs as interfaces to EHR, but also highlight the outstanding challenge posed by hallucinations. In this setting, however, we show that model confidence in outputs strongly correlates with faithful summaries, offering a practical means to limit confabulations.

Create account to get full access

Overview

This paper explores using large language models (LLMs) to efficiently retrieve and summarize relevant information from unstructured Electronic Health Records (EHRs) to aid radiologists' diagnoses.
The key idea is to task an LLM with inferring whether a patient has or is at risk of a particular condition based on their EHR notes, and then ask the model to summarize the supporting evidence.
The researchers evaluate this approach against a pre-LLM information retrieval baseline and find that the LLM-based method produces outputs that are consistently preferred by experts.
They also propose and validate a method to use an LLM to evaluate other LLM outputs for this task, allowing them to scale up the evaluation process.

Plain English Explanation

Electronic Health Records (EHRs) often contain valuable information that could help radiologists make more informed diagnoses. However, the huge volume of notes associated with each patient, coupled with the time constraints radiologists face, makes it practically impossible for them to manually identify all the relevant evidence.

The researchers in this study propose using large language models (LLMs) as a way to efficiently retrieve and summarize the unstructured information in EHRs that could be relevant to a given medical query. The idea is to task the LLM with determining whether a patient has or is at risk of a particular condition, and then ask it to provide a summary of the evidence from the patient's notes that supports that assessment.

When the researchers had experts evaluate the outputs of this LLM-based approach, they found that the experts consistently preferred these outputs over the results of a traditional information retrieval system. This suggests that LLMs can be effective interfaces for accessing the wealth of information contained in EHRs.

Since manually evaluating all the LLM outputs would be extremely time-consuming, the researchers also developed a way to use an LLM to automatically evaluate other LLM outputs for this task. This allows them to scale up the evaluation process and get a better sense of the overall performance of their approach.

Overall, this research indicates the promise of using LLMs to help radiologists more easily access and make use of the valuable information hidden in EHR notes. However, it also highlights the challenge of dealing with LLM outputs that may include fabricated or inaccurate information, known as "hallucinations." The researchers show that in this particular setting, the LLM's own confidence in its outputs can be a useful signal for identifying faithful summaries and limiting confabulations.

Technical Explanation

The key innovation in this work is the use of a zero-shot strategy that tasks an LLM with inferring whether a patient has or is at risk of a particular condition based on their EHR notes, and then asking the model to summarize the supporting evidence.

To evaluate this approach, the researchers compared it to a pre-LLM information retrieval baseline. They found that under expert evaluation, the LLM-based outputs were consistently preferred over the baseline.

Manual evaluation is expensive, so the researchers also proposed and validated a method using an LLM to evaluate (other) LLM outputs for this task. This allowed them to scale up the evaluation process and get a more comprehensive understanding of the performance of their approach.

The findings indicate the promise of LLMs as interfaces to EHRs, but also highlight the challenge of dealing with hallucinations - outputs that contain fabricated or inaccurate information. However, the researchers show that in this setting, the model's own confidence in its outputs strongly correlates with the faithfulness of the summaries, providing a practical way to limit confabulations.

Critical Analysis

The researchers acknowledge several limitations and areas for further research. First, they note that their evaluation was limited to a small set of conditions, and more comprehensive testing is needed to fully understand the capabilities and limitations of their approach.

Additionally, while they demonstrate that model confidence can be used to identify faithful summaries, this is not a perfect solution. There may still be cases where the model is overconfident in its incorrect outputs. Further research is needed to develop more robust methods for detecting and mitigating hallucinations in this context.

Another potential issue is the reliance on expert manual evaluation, which is time-consuming and expensive. While the researchers' proposed method for using an LLM to evaluate other LLM outputs is a step in the right direction, more work is needed to fully automate the evaluation process and make it scalable.

Finally, the researchers do not address the potential ethical and privacy concerns that may arise from using LLMs to access and summarize sensitive patient data. As these technologies become more widely adopted, it will be crucial to carefully consider the implications and develop appropriate safeguards.

Conclusion

This research demonstrates the potential of using large language models to efficiently extract and summarize relevant information from unstructured Electronic Health Records to support radiologists' diagnoses. The LLM-based approach outperformed a traditional information retrieval baseline, and the researchers developed a novel method to scale up the evaluation process.

However, the study also highlights the challenge of dealing with hallucinations - LLM outputs that contain fabricated or inaccurate information. While the researchers showed that model confidence can be a useful signal for identifying faithful summaries, more work is needed to fully address this issue.

Overall, this research represents an important step forward in leveraging the power of large language models to unlock the valuable information contained in unstructured medical data. As these technologies continue to evolve, it will be crucial to carefully consider the ethical and practical implications to ensure they are deployed in a responsible and beneficial manner.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

💬

A scoping review of using Large Language Models (LLMs) to investigate Electronic Health Records (EHRs)

Lingyao Li, Jiayan Zhou, Zhenxiang Gao, Wenyue Hua, Lizhou Fan, Huizi Yu, Loni Hagen, Yongfeng Zhang, Themistocles L. Assimes, Libby Hemphill, Siyuan Ma

Electronic Health Records (EHRs) play an important role in the healthcare system. However, their complexity and vast volume pose significant challenges to data interpretation and analysis. Recent advancements in Artificial Intelligence (AI), particularly the development of Large Language Models (LLMs), open up new opportunities for researchers in this domain. Although prior studies have demonstrated their potential in language understanding and processing in the context of EHRs, a comprehensive scoping review is lacking. This study aims to bridge this research gap by conducting a scoping review based on 329 related papers collected from OpenAlex. We first performed a bibliometric analysis to examine paper trends, model applications, and collaboration networks. Next, we manually reviewed and categorized each paper into one of the seven identified topics: named entity recognition, information extraction, text similarity, text summarization, text classification, dialogue system, and diagnosis and prediction. For each topic, we discussed the unique capabilities of LLMs, such as their ability to understand context, capture semantic relations, and generate human-like text. Finally, we highlighted several implications for researchers from the perspectives of data resources, prompt engineering, fine-tuning, performance measures, and ethical concerns. In conclusion, this study provides valuable insights into the potential of LLMs to transform EHR research and discusses their applications and ethical considerations.

5/24/2024

cs.ET

EHRNoteQA: An LLM Benchmark for Real-World Clinical Practice Using Discharge Summaries

Sunjun Kweon, Jiyoun Kim, Heeyoung Kwak, Dongchul Cha, Hangyul Yoon, Kwanghyun Kim, Jeewon Yang, Seunghyun Won, Edward Choi

Discharge summaries in Electronic Health Records (EHRs) are crucial for clinical decision-making, but their length and complexity make information extraction challenging, especially when dealing with accumulated summaries across multiple patient admissions. Large Language Models (LLMs) show promise in addressing this challenge by efficiently analyzing vast and complex data. Existing benchmarks, however, fall short in properly evaluating LLMs' capabilities in this context, as they typically focus on single-note information or limited topics, failing to reflect the real-world inquiries required by clinicians. To bridge this gap, we introduce EHRNoteQA, a novel benchmark built on the MIMIC-IV EHR, comprising 962 different QA pairs each linked to distinct patients' discharge summaries. Every QA pair is initially generated using GPT-4 and then manually reviewed and refined by three clinicians to ensure clinical relevance. EHRNoteQA includes questions that require information across multiple discharge summaries and covers eight diverse topics, mirroring the complexity and diversity of real clinical inquiries. We offer EHRNoteQA in two formats: open-ended and multi-choice question answering, and propose a reliable evaluation method for each. We evaluate 27 LLMs using EHRNoteQA and examine various factors affecting the model performance (e.g., the length and number of discharge summaries). Furthermore, to validate EHRNoteQA as a reliable proxy for expert evaluations in clinical practice, we measure the correlation between the LLM performance on EHRNoteQA, and the LLM performance manually evaluated by clinicians. Results show that LLM performance on EHRNoteQA have higher correlation with clinician-evaluated performance (Spearman: 0.78, Kendall: 0.62) compared to other benchmarks, demonstrating its practical relevance in evaluating LLMs in clinical settings.

6/28/2024

cs.CL

💬

Adapted Large Language Models Can Outperform Medical Experts in Clinical Text Summarization

Dave Van Veen, Cara Van Uden, Louis Blankemeier, Jean-Benoit Delbrouck, Asad Aali, Christian Bluethgen, Anuj Pareek, Malgorzata Polacin, Eduardo Pontes Reis, Anna Seehofnerova, Nidhi Rohatgi, Poonam Hosamani, William Collins, Neera Ahuja, Curtis P. Langlotz, Jason Hom, Sergios Gatidis, John Pauly, Akshay S. Chaudhari

Analyzing vast textual data and summarizing key information from electronic health records imposes a substantial burden on how clinicians allocate their time. Although large language models (LLMs) have shown promise in natural language processing (NLP), their effectiveness on a diverse range of clinical summarization tasks remains unproven. In this study, we apply adaptation methods to eight LLMs, spanning four distinct clinical summarization tasks: radiology reports, patient questions, progress notes, and doctor-patient dialogue. Quantitative assessments with syntactic, semantic, and conceptual NLP metrics reveal trade-offs between models and adaptation methods. A clinical reader study with ten physicians evaluates summary completeness, correctness, and conciseness; in a majority of cases, summaries from our best adapted LLMs are either equivalent (45%) or superior (36%) compared to summaries from medical experts. The ensuing safety analysis highlights challenges faced by both LLMs and medical experts, as we connect errors to potential medical harm and categorize types of fabricated information. Our research provides evidence of LLMs outperforming medical experts in clinical text summarization across multiple tasks. This suggests that integrating LLMs into clinical workflows could alleviate documentation burden, allowing clinicians to focus more on patient care.

4/15/2024

cs.CL

Zero-Shot Clinical Trial Patient Matching with LLMs

Michael Wornow, Alejandro Lozano, Dev Dash, Jenelle Jindal, Kenneth W. Mahaffey, Nigam H. Shah

Matching patients to clinical trials is a key unsolved challenge in bringing new drugs to market. Today, identifying patients who meet a trial's eligibility criteria is highly manual, taking up to 1 hour per patient. Automated screening is challenging, however, as it requires understanding unstructured clinical text. Large language models (LLMs) offer a promising solution. In this work, we explore their application to trial matching. First, we design an LLM-based system which, given a patient's medical history as unstructured clinical text, evaluates whether that patient meets a set of inclusion criteria (also specified as free text). Our zero-shot system achieves state-of-the-art scores on the n2c2 2018 cohort selection benchmark. Second, we improve the data and cost efficiency of our method by identifying a prompting strategy which matches patients an order of magnitude faster and more cheaply than the status quo, and develop a two-stage retrieval pipeline that reduces the number of tokens processed by up to a third while retaining high performance. Third, we evaluate the interpretability of our system by having clinicians evaluate the natural language justifications generated by the LLM for each eligibility decision, and show that it can output coherent explanations for 97% of its correct decisions and 75% of its incorrect ones. Our results establish the feasibility of using LLMs to accelerate clinical trial operations.

4/11/2024

cs.CL cs.AI