Language Models and Retrieval Augmented Generation for Automated Structured Data Extraction from Diagnostic Reports

Read original: arXiv:2409.10576 - Published 9/19/2024 by Mohamed Sobhi Jabal, Pranav Warman, Jikai Zhang, Kartikeye Gupta, Ayush Jain, Maciej Mazurowski, Walter Wiggins, Kirti Magudia, Evan Calabrese

💬

Overview

The paper aims to develop and evaluate an automated system for extracting structured clinical information from unstructured radiology and pathology reports.
The system uses open-sourced large language models (LMs) and retrieval-augmented generation (RAG) techniques.
The study assesses the impact of various model configuration variables on the extraction performance.

Plain English Explanation

The paper describes a system that can automatically extract important clinical information from doctors' notes and reports. These notes are often written in a free-form, unstructured way, making it difficult for computers to understand and extract the key details.

The researchers developed a pipeline that uses powerful language models - large AI systems trained on massive amounts of text data - to analyze the reports and pull out specific pieces of information. For example, they tested the system on extracting brain tumor staging scores from radiology reports and detecting genetic mutations from pathology reports.

They found that the best-performing models could achieve over 98% accuracy in extracting the brain tumor scores and over 90% accuracy for the genetic mutation status. The key factors that improved performance were using larger, more recently developed language models that had been fine-tuned on medical data, as well as carefully crafting the prompts (instructions) given to the models.

This shows that these AI language models have significant potential to automate the extraction of structured medical data from unstructured reports. This could save clinicians time and effort, and make it easier to gather and analyze large datasets for research. However, the researchers note that careful model selection, prompt engineering, and semi-automated optimization are critical for getting the best results.

Technical Explanation

The paper presents an automated pipeline for extracting structured clinical information from unstructured radiology and pathology reports using large language models (LMs) and retrieval-augmented generation (RAG) techniques.

The researchers tested the system on two clinical datasets: 7,294 radiology reports annotated for Brain Tumor Reporting and Data System (BT-RADS) scores, and 2,154 pathology reports annotated for isocitrate dehydrogenase (IDH) mutation status. They systematically evaluated the impact of various model configuration variables, including model size, quantization, prompting strategies, output formatting, and inference parameters.

The best-performing models achieved over 98% accuracy in extracting BT-RADS scores from radiology reports and over 90% accuracy for IDH mutation status extraction from pathology reports. The top-performing model was a medical fine-tuned version of the Llama3 language model. Larger, newer, and domain-specific fine-tuned models consistently outperformed older and smaller models.

Model quantization (a technique to reduce model size and inference time) had minimal impact on performance. Carefully crafted prompting strategies, such as few-shot learning, significantly improved accuracy. The RAG approach improved performance for complex pathology reports but not for the shorter radiology reports.

Critical Analysis

The researchers acknowledge several limitations and areas for further research:

The study was limited to two specific clinical tasks (BT-RADS extraction and IDH mutation status), and the findings may not generalize to other types of clinical information extraction.
The performance of the models was evaluated on pre-annotated datasets, which may not reflect real-world challenges such as ambiguous or incomplete reports.
The study did not explore the impact of combining multiple LMs or ensembling different approaches, which could potentially further improve performance.
The researchers note that the semi-automated optimization process required for optimal performance may limit the scalability and accessibility of the system.

Additionally, while the results are promising, the researchers emphasize that careful model selection, prompt engineering, and semi-automated optimization are critical for achieving the best performance. This suggests that the deployment of such systems in real-world clinical settings may require significant effort and expertise.

Conclusion

The paper demonstrates the significant potential of open-sourced large language models and retrieval-augmented generation techniques for automated extraction of structured clinical data from unstructured medical reports. The best-performing models achieved over 98% accuracy on the tested tasks, highlighting the feasibility of these approaches for practical use in research workflows.

However, the researchers caution that careful model selection, prompt engineering, and semi-automated optimization are essential for optimal performance. These findings suggest that human-machine collaboration, with clinicians and data scientists working together, may be the most reliable and scalable approach for leveraging these technologies in healthcare data extraction.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

💬

New!Language Models and Retrieval Augmented Generation for Automated Structured Data Extraction from Diagnostic Reports

Mohamed Sobhi Jabal, Pranav Warman, Jikai Zhang, Kartikeye Gupta, Ayush Jain, Maciej Mazurowski, Walter Wiggins, Kirti Magudia, Evan Calabrese

Purpose: To develop and evaluate an automated system for extracting structured clinical information from unstructured radiology and pathology reports using open-weights large language models (LMs) and retrieval augmented generation (RAG), and to assess the effects of model configuration variables on extraction performance. Methods and Materials: The study utilized two datasets: 7,294 radiology reports annotated for Brain Tumor Reporting and Data System (BT-RADS) scores and 2,154 pathology reports annotated for isocitrate dehydrogenase (IDH) mutation status. An automated pipeline was developed to benchmark the performance of various LMs and RAG configurations. The impact of model size, quantization, prompting strategies, output formatting, and inference parameters was systematically evaluated. Results: The best performing models achieved over 98% accuracy in extracting BT-RADS scores from radiology reports and over 90% for IDH mutation status extraction from pathology reports. The top model being medical fine-tuned llama3. Larger, newer, and domain fine-tuned models consistently outperformed older and smaller models. Model quantization had minimal impact on performance. Few-shot prompting significantly improved accuracy. RAG improved performance for complex pathology reports but not for shorter radiology reports. Conclusions: Open LMs demonstrate significant potential for automated extraction of structured clinical data from unstructured clinical reports with local privacy-preserving application. Careful model selection, prompt engineering, and semi-automated optimization using annotated data are critical for optimal performance. These approaches could be reliable enough for practical use in research workflows, highlighting the potential for human-machine collaboration in healthcare data extraction.

9/19/2024

💬

RadioRAG: Factual Large Language Models for Enhanced Diagnostics in Radiology Using Dynamic Retrieval Augmented Generation

Soroosh Tayebi Arasteh, Mahshad Lotfinia, Keno Bressem, Robert Siepmann, Dyke Ferber, Christiane Kuhl, Jakob Nikolas Kather, Sven Nebelung, Daniel Truhn

Large language models (LLMs) have advanced the field of artificial intelligence (AI) in medicine. However LLMs often generate outdated or inaccurate information based on static training datasets. Retrieval augmented generation (RAG) mitigates this by integrating outside data sources. While previous RAG systems used pre-assembled, fixed databases with limited flexibility, we have developed Radiology RAG (RadioRAG) as an end-to-end framework that retrieves data from authoritative radiologic online sources in real-time. RadioRAG is evaluated using a dedicated radiologic question-and-answer dataset (RadioQA). We evaluate the diagnostic accuracy of various LLMs when answering radiology-specific questions with and without access to additional online information via RAG. Using 80 questions from RSNA Case Collection across radiologic subspecialties and 24 additional expert-curated questions, for which the correct gold-standard answers were available, LLMs (GPT-3.5-turbo, GPT-4, Mistral-7B, Mixtral-8x7B, and Llama3 [8B and 70B]) were prompted with and without RadioRAG. RadioRAG retrieved context-specific information from www.radiopaedia.org in real-time and incorporated them into its reply. RadioRAG consistently improved diagnostic accuracy across all LLMs, with relative improvements ranging from 2% to 54%. It matched or exceeded question answering without RAG across radiologic subspecialties, particularly in breast imaging and emergency radiology. However, degree of improvement varied among models; GPT-3.5-turbo and Mixtral-8x7B-instruct-v0.1 saw notable gains, while Mistral-7B-instruct-v0.2 showed no improvement, highlighting variability in its effectiveness. LLMs benefit when provided access to domain-specific data beyond their training data. For radiology, RadioRAG establishes a robust framework that substantially improves diagnostic accuracy and factuality in radiological question answering.

7/23/2024

KARGEN: Knowledge-enhanced Automated Radiology Report Generation Using Large Language Models

Yingshu Li, Zhanyu Wang, Yunyi Liu, Lei Wang, Lingqiao Liu, Luping Zhou

Harnessing the robust capabilities of Large Language Models (LLMs) for narrative generation, logical reasoning, and common-sense knowledge integration, this study delves into utilizing LLMs to enhance automated radiology report generation (R2Gen). Despite the wealth of knowledge within LLMs, efficiently triggering relevant knowledge within these large models for specific tasks like R2Gen poses a critical research challenge. This paper presents KARGEN, a Knowledge-enhanced Automated radiology Report GENeration framework based on LLMs. Utilizing a frozen LLM to generate reports, the framework integrates a knowledge graph to unlock chest disease-related knowledge within the LLM to enhance the clinical utility of generated reports. This is achieved by leveraging the knowledge graph to distill disease-related features in a designed way. Since a radiology report encompasses both normal and disease-related findings, the extracted graph-enhanced disease-related features are integrated with regional image features, attending to both aspects. We explore two fusion methods to automatically prioritize and select the most relevant features. The fused features are employed by LLM to generate reports that are more sensitive to diseases and of improved quality. Our approach demonstrates promising results on the MIMIC-CXR and IU-Xray datasets.

9/10/2024

Harnessing Knowledge Retrieval with Large Language Models for Clinical Report Error Correction

Jinge Wu, Zhaolong Wu, Ruizhe Li, Abul Hasan, Yunsoo Kim, Jason P. Y. Cheung, Teng Zhang, Honghan Wu

This study proposes an approach for error correction in radiology reports, leveraging large language models (LLMs) and retrieval-augmented generation (RAG) techniques. The proposed framework employs a novel internal+external retrieval mechanism to extract relevant medical entities and relations from the report of interest and an external knowledge source. A three-stage inference process is introduced, decomposing the task into error detection, localization, and correction subtasks, which enhances the explainability and performance of the system. The effectiveness of the approach is evaluated using a benchmark dataset created by corrupting real-world radiology reports with realistic errors, guided by domain experts. Experimental results demonstrate the benefits of the proposed methods, with the combination of internal and external retrieval significantly improving the accuracy of error detection, localization, and correction across various state-of-the-art LLMs. The findings contribute to the development of more robust and reliable error correction systems for clinical documentation.

9/19/2024