General-Purpose vs. Domain-Adapted Large Language Models for Extraction of Structured Data from Chest Radiology Reports

2311.17213

Published 4/10/2024 by Ali H. Dhanaliwala, Rikhiya Ghosh, Sanjeev Kumar Karn, Poikavila Ullaskrishnan, Oladimeji Farri, Dorin Comaniciu, Charles E. Kahn

cs.CL eess.IV

💬

Abstract

Radiologists produce unstructured data that can be valuable for clinical care when consumed by information systems. However, variability in style limits usage. Study compares system using domain-adapted language model (RadLing) and general-purpose LLM (GPT-4) in extracting relevant features from chest radiology reports and standardizing them to common data elements (CDEs). Three radiologists annotated a retrospective dataset of 1399 chest XR reports (900 training, 499 test) and mapped to 44 pre-selected relevant CDEs. GPT-4 system was prompted with report, feature set, value set, and dynamic few-shots to extract values and map to CDEs. Output key:value pairs were compared to reference standard at both stages and an identical match was considered TP. F1 score for extraction was 97% for RadLing-based system and 78% for GPT-4 system. F1 score for mapping was 98% for RadLing and 94% for GPT-4; difference was statistically significant (P<.001). RadLing's domain-adapted embeddings were better in feature extraction and its light-weight mapper had better f1 score in CDE assignment. RadLing system also demonstrated higher capabilities in differentiating between absent (99% vs 64%) and unspecified (99% vs 89%). RadLing system's domain-adapted embeddings helped improve performance of GPT-4 system to 92% by giving more relevant few-shot prompts. RadLing system offers operational advantages including local deployment and reduced runtime costs.

Create account to get full access

Overview

Radiologists often produce unstructured data in their medical reports that could be valuable for clinical care, but the variability in their writing style makes it difficult for information systems to utilize this data.
This study compares the performance of two language models in extracting relevant features from chest radiology reports and mapping them to a set of common data elements (CDEs):
- A domain-adapted language model called RadLing
- A general-purpose language model, GPT-4

Plain English Explanation

Radiologists write reports about medical scans like chest X-rays. These reports contain useful information, but the language they use can vary a lot. This makes it hard for computer systems to automatically understand and use the data in the reports.

The researchers in this study tested two different AI models to see which one could do a better job of extracting key details from the radiology reports and converting them into a standardized format. One model, called RadLing, was trained specifically on medical language. The other, GPT-4, is a more general-purpose AI model.

The researchers had three radiologists annotate a dataset of 1,399 chest X-ray reports, mapping the information in the reports to 44 pre-defined categories (the common data elements or CDEs).

They then tested the two AI models to see how well they could:

Extract the relevant information from the reports
Map that information to the correct CDE categories

The RadLing model performed better at both tasks, with an accuracy (F1 score) of 97% for extraction and 98% for mapping. The GPT-4 model had lower accuracy, at 78% for extraction and 94% for mapping.

The researchers found that RadLing's specialized medical language understanding helped it do a better job of identifying the relevant information and assigning it to the right categories. GPT-4, being a more general model, struggled more with the specialized medical terminology and concepts.

Interestingly, the researchers also found that adding some of RadLing's specialized knowledge to GPT-4 improved its performance, boosting its accuracy to 92% for mapping the information to the CDE categories.

Technical Explanation

The researchers compared the performance of two language models in extracting relevant features from chest radiology reports and mapping them to a set of 44 pre-selected common data elements (CDEs):

RadLing: A domain-adapted language model trained specifically on medical text
GPT-4: A general-purpose large language model

A dataset of 1,399 chest X-ray reports (900 for training, 499 for testing) was annotated by three radiologists, who mapped the information in the reports to the 44 CDE categories.

The GPT-4 system was prompted with the report text, the feature set, and the value set, as well as some dynamic few-shot examples, to extract values and map them to the CDEs. The output key-value pairs were compared to the reference standard, and an exact match was considered a true positive.

The RadLing-based system achieved an F1 score of 97% for feature extraction and 98% for CDE mapping. The GPT-4 system had lower performance, with an F1 score of 78% for extraction and 94% for mapping. The difference in mapping performance was statistically significant (p<0.001).

The researchers found that RadLing's domain-adapted embeddings were more effective at both feature extraction and CDE assignment. RadLing's lightweight mapper also outperformed the GPT-4 system in CDE mapping.

Additionally, the RadLing system demonstrated higher capabilities in differentiating between absent (99% vs 64%) and unspecified (99% vs 89%) values compared to GPT-4.

Interestingly, the researchers found that incorporating RadLing's domain-adapted embeddings into the GPT-4 system improved its performance on CDE mapping to 92%, by providing more relevant few-shot prompts.

The RadLing system also offers operational advantages, such as the ability to be deployed locally and reduced runtime costs, compared to the GPT-4 system.

Critical Analysis

The study provides a thorough comparison of the performance of a domain-adapted language model (RadLing) and a general-purpose language model (GPT-4) in extracting and mapping relevant features from chest radiology reports. The researchers' careful experimental design and comprehensive evaluation metrics give confidence in the validity of their findings.

However, the study does not address several potential limitations and areas for further research:

Generalizability: The study focuses on chest radiology reports, but it's unclear how the models would perform on other types of radiology reports or medical text. Further research is needed to assess the generalizability of the findings.
Real-world deployment: The study evaluated the models' performance in a controlled setting, but real-world deployment may introduce additional challenges, such as handling incomplete or noisy data, or integrating the models into existing clinical workflows.
Clinical impact: While the study demonstrates technical superiority of the RadLing model, it does not explicitly assess the potential clinical impact or benefits of using such a system in practice. Further research is needed to understand how these advancements in language understanding could improve patient outcomes or clinical decision-making.
Ethical considerations: The study does not address potential ethical concerns, such as the risk of amplifying biases in the training data or the implications of automating tasks traditionally performed by human experts.

Despite these limitations, the study makes a valuable contribution to the field of medical language processing and highlights the importance of domain-specific language models for improving the extraction and utilization of unstructured data in healthcare.

Conclusion

This study demonstrates the advantages of using a domain-adapted language model, RadLing, for extracting and mapping relevant features from chest radiology reports, compared to a general-purpose language model like GPT-4. The RadLing system achieved significantly higher performance in both feature extraction and mapping to common data elements, highlighting the importance of specialized language understanding for processing unstructured medical data.

The findings of this research have important implications for the development of clinical decision support systems and the broader adoption of natural language processing technologies in healthcare. By improving the ability to extract and standardize valuable information from radiology reports, the RadLing system could enhance clinical workflows, support data-driven decision-making, and ultimately improve patient outcomes.

While further research is needed to assess the generalizability and real-world deployment of this approach, this study represents an important step forward in the ongoing effort to harness the power of unstructured medical data to drive innovation and progress in the healthcare industry.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

💬

Exploring Multilingual Large Language Models for Enhanced TNM classification of Radiology Report in lung cancer staging

Hidetoshi Matsuo, Mizuho Nishio, Takaaki Matsunaga, Koji Fujimoto, Takamichi Murakami

Background: Structured radiology reports remains underdeveloped due to labor-intensive structuring and narrative-style reporting. Deep learning, particularly large language models (LLMs) like GPT-3.5, offers promise in automating the structuring of radiology reports in natural languages. However, although it has been reported that LLMs are less effective in languages other than English, their radiological performance has not been extensively studied. Purpose: This study aimed to investigate the accuracy of TNM classification based on radiology reports using GPT3.5-turbo (GPT3.5) and the utility of multilingual LLMs in both Japanese and English. Material and Methods: Utilizing GPT3.5, we developed a system to automatically generate TNM classifications from chest CT reports for lung cancer and evaluate its performance. We statistically analyzed the impact of providing full or partial TNM definitions in both languages using a Generalized Linear Mixed Model. Results: Highest accuracy was attained with full TNM definitions and radiology reports in English (M = 94%, N = 80%, T = 47%, and ALL = 36%). Providing definitions for each of the T, N, and M factors statistically improved their respective accuracies (T: odds ratio (OR) = 2.35, p < 0.001; N: OR = 1.94, p < 0.01; M: OR = 2.50, p < 0.001). Japanese reports exhibited decreased N and M accuracies (N accuracy: OR = 0.74 and M accuracy: OR = 0.21). Conclusion: This study underscores the potential of multilingual LLMs for automatic TNM classification in radiology reports. Even without additional model training, performance improvements were evident with the provided TNM definitions, indicating LLMs' relevance in radiology contexts.

6/13/2024

cs.CL cs.AI cs.LG

💬

Use of a Structured Knowledge Base Enhances Metadata Curation by Large Language Models

Sowmya S. Sundaram, Benjamin Solomon, Avani Khatri, Anisha Laumas, Purvesh Khatri, Mark A. Musen

Metadata play a crucial role in ensuring the findability, accessibility, interoperability, and reusability of datasets. This paper investigates the potential of large language models (LLMs), specifically GPT-4, to improve adherence to metadata standards. We conducted experiments on 200 random data records describing human samples relating to lung cancer from the NCBI BioSample repository, evaluating GPT-4's ability to suggest edits for adherence to metadata standards. We computed the adherence accuracy of field name-field value pairs through a peer review process, and we observed a marginal average improvement in adherence to the standard data dictionary from 79% to 80% (p<0.01). We then prompted GPT-4 with domain information in the form of the textual descriptions of CEDAR templates and recorded a significant improvement to 97% from 79% (p<0.01). These results indicate that, while LLMs may not be able to correct legacy metadata to ensure satisfactory adherence to standards when unaided, they do show promise for use in automated metadata curation when integrated with a structured knowledge base.

4/10/2024

cs.AI cs.CL cs.IR

Towards a clinically accessible radiology foundation model: open-access and lightweight, with automated evaluation

Juan Manuel Zambrano Chaves, Shih-Cheng Huang, Yanbo Xu, Hanwen Xu, Naoto Usuyama, Sheng Zhang, Fei Wang, Yujia Xie, Mahmoud Khademi, Ziyi Yang, Hany Awadalla, Julia Gong, Houdong Hu, Jianwei Yang, Chunyuan Li, Jianfeng Gao, Yu Gu, Cliff Wong, Mu Wei, Tristan Naumann, Muhao Chen, Matthew P. Lungren, Akshay Chaudhari, Serena Yeung-Levy, Curtis P. Langlotz, Sheng Wang, Hoifung Poon

The scaling laws and extraordinary performance of large foundation models motivate the development and utilization of such models in biomedicine. However, despite early promising results on some biomedical benchmarks, there are still major challenges that need to be addressed before these models can be used in real-world clinics. Frontier general-domain models such as GPT-4V still have significant performance gaps in multimodal biomedical applications. More importantly, less-acknowledged pragmatic issues, including accessibility, model cost, and tedious manual evaluation make it hard for clinicians to use state-of-the-art large models directly on private patient data. Here, we explore training open-source small multimodal models (SMMs) to bridge competency gaps for unmet clinical needs in radiology. To maximize data efficiency, we adopt a modular approach by incorporating state-of-the-art pre-trained models for image and text modalities, and focusing on training a lightweight adapter to ground each modality to the text embedding space, as exemplified by LLaVA-Med. For training, we assemble a large dataset of over 697 thousand radiology image-text pairs. For evaluation, we propose CheXprompt, a GPT-4-based metric for factuality evaluation, and demonstrate its parity with expert evaluation. For best practice, we conduct a systematic ablation study on various choices in data engineering and multimodal training. The resulting LlaVA-Rad (7B) model attains state-of-the-art results on standard radiology tasks such as report generation and cross-modal retrieval, even outperforming much larger models such as GPT-4V and Med-PaLM M (84B). The inference of LlaVA-Rad is fast and can be performed on a single V100 GPU in private settings, offering a promising state-of-the-art tool for real-world clinical applications.

6/28/2024

cs.CL cs.CV

Simplifying Multimodality: Unimodal Approach to Multimodal Challenges in Radiology with General-Domain Large Language Model

Seonhee Cho, Choonghan Kim, Jiho Lee, Chetan Chilkunda, Sujin Choi, Joo Heung Yoon

Recent advancements in Large Multimodal Models (LMMs) have attracted interest in their generalization capability with only a few samples in the prompt. This progress is particularly relevant to the medical domain, where the quality and sensitivity of data pose unique challenges for model training and application. However, the dependency on high-quality data for effective in-context learning raises questions about the feasibility of these models when encountering with the inevitable variations and errors inherent in real-world medical data. In this paper, we introduce MID-M, a novel framework that leverages the in-context learning capabilities of a general-domain Large Language Model (LLM) to process multimodal data via image descriptions. MID-M achieves a comparable or superior performance to task-specific fine-tuned LMMs and other general-domain ones, without the extensive domain-specific training or pre-training on multimodal data, with significantly fewer parameters. This highlights the potential of leveraging general-domain LLMs for domain-specific tasks and offers a sustainable and cost-effective alternative to traditional LMM developments. Moreover, the robustness of MID-M against data quality issues demonstrates its practical utility in real-world medical domain applications.

5/6/2024

cs.CL cs.AI eess.IV