SDoH-GPT: Using Large Language Models to Extract Social Determinants of Health (SDoH)

Read original: arXiv:2407.17126 - Published 7/25/2024 by Bernardo Consoli, Xizhi Wu, Song Wang, Xinyu Zhao, Yanshan Wang, Justin Rousseau, Tom Hartvigsen, Li Shen, Huanmei Wu, Yifan Peng and 3 others
Total Score

0

💬

Sign in to get full access

or

If you already have an account, we'll log you in

Overview

  • Extracting social determinants of health (SDoH) from medical notes is typically labor-intensive and task-specific, limiting reusability and sharing.
  • This study introduced SDoH-GPT, a simple and effective few-shot Large Language Model (LLM) method that can extract SDoH without relying on extensive medical annotations or costly human intervention.
  • SDoH-GPT achieved tenfold and twentyfold reductions in time and cost respectively, and superior consistency with human annotators.
  • The combination of SDoH-GPT and XGBoost ensures high accuracy and computational efficiency.

Plain English Explanation

The paper discusses a new method called SDoH-GPT that can automatically extract important social and environmental factors, known as social determinants of health (SDoH), from medical notes. This is a critical task for understanding patients' overall health and well-being, but traditionally, it has required a lot of manual effort and specialized knowledge to do.

The researchers developed SDoH-GPT, which is a machine learning model that can be "trained" on just a few examples to identify SDoH in medical notes. This is much faster and cheaper than the traditional approach, which relies on having large datasets of medical notes that have been carefully annotated by human experts.

The researchers found that SDoH-GPT could do this task 10 times faster and 20 times cheaper than the traditional method, while still achieving a high level of accuracy compared to human experts. This is a significant improvement that could make it much easier for healthcare providers to incorporate SDoH into their analysis and decision-making.

The paper also describes how SDoH-GPT can be combined with another machine learning model called XGBoost to further improve the accuracy and efficiency of the SDoH extraction process. This combined approach has been tested on multiple datasets and shown to be robust and reliable.

Overall, this research demonstrates the potential for large language models, like the one used in SDoH-GPT, to revolutionize medical note classification and analysis, making it faster, cheaper, and more accessible to healthcare providers.

Technical Explanation

The key innovation in this paper is the introduction of SDoH-GPT, a large language model (LLM) -based method for extracting social determinants of health (SDoH) from unstructured medical notes.

Traditionally, extracting SDoH from medical notes has relied on labor-intensive manual annotations, which are typically task-specific and limit the reusability and sharing of the extracted information. To address this, the researchers leveraged the few-shot learning capabilities of LLMs, using contrastive examples and concise instructions to train the model to identify SDoH without requiring extensive medical annotations or costly human intervention.

The results show that SDoH-GPT achieved a tenfold and twentyfold reduction in time and cost, respectively, compared to the traditional approach. Additionally, SDoH-GPT demonstrated superior consistency with human annotators, with Cohen's kappa scores of up to 0.92.

To further enhance the accuracy and computational efficiency, the researchers combined SDoH-GPT with the XGBoost machine learning algorithm. This innovative combination leverages the strengths of both approaches, ensuring high accuracy (consistently maintaining 0.90+ AUROC scores) while maintaining computational efficiency.

The researchers tested the SDoH-GPT and XGBoost approach across three distinct datasets, confirming its robustness and accuracy. This study highlights the potential of leveraging LLMs to revolutionize medical note classification, demonstrating their capability to achieve highly accurate classifications with significantly reduced time and cost.

Critical Analysis

The paper presents a promising approach to extracting social determinants of health (SDoH) from medical notes, addressing the limitations of traditional labor-intensive methods. The use of a large language model (LLM) in the form of SDoH-GPT is a novel and innovative solution that has demonstrated significant improvements in time, cost, and consistency with human experts.

One potential limitation of the study is the reliance on contrastive examples and concise instructions for training the LLM. While this approach has shown promising results, it may be challenging to generalize to more diverse or complex medical note datasets. Additionally, the performance of the LLM-based approach may be influenced by the quality and comprehensiveness of the training data, which could vary across healthcare systems and regions.

Furthermore, the paper does not provide a detailed analysis of the types of SDoH that the model is able to extract or the potential biases that may be present in the extracted information. It would be valuable to understand the model's performance on specific SDoH categories, such as socioeconomic status, education, or environmental factors, to assess its robustness and identify any areas for further improvement.

Despite these potential limitations, the research presented in this paper represents an important step towards leveraging large language models for medical note classification and analysis. The combination of SDoH-GPT and XGBoost, as well as the demonstrated robustness across multiple datasets, suggests that this approach has the potential to be a valuable tool for healthcare providers in understanding and addressing the social determinants that impact patient health outcomes.

Conclusion

This research paper introduces a novel and effective method, SDoH-GPT, for extracting social determinants of health (SDoH) from unstructured medical notes. By leveraging the few-shot learning capabilities of large language models and combining them with the XGBoost algorithm, the researchers have developed a solution that can significantly reduce the time and cost associated with traditional SDoH extraction methods while maintaining high accuracy and consistency with human experts.

The potential impact of this research is significant, as it could enable healthcare providers to more easily incorporate SDoH into their analysis and decision-making processes, leading to improved patient outcomes and a better understanding of the social and environmental factors that influence health. Additionally, the robust performance of the SDoH-GPT and XGBoost approach across multiple datasets suggests that it could be widely adopted and integrated into various healthcare systems.

This study highlights the transformative potential of large language models in the medical domain, and the researchers have demonstrated a compelling example of how these powerful AI tools can be leveraged to revolutionize medical note classification and analysis, ultimately leading to more informed and personalized healthcare decisions.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

💬

Total Score

0

SDoH-GPT: Using Large Language Models to Extract Social Determinants of Health (SDoH)

Bernardo Consoli, Xizhi Wu, Song Wang, Xinyu Zhao, Yanshan Wang, Justin Rousseau, Tom Hartvigsen, Li Shen, Huanmei Wu, Yifan Peng, Qi Long, Tianlong Chen, Ying Ding

Extracting social determinants of health (SDoH) from unstructured medical notes depends heavily on labor-intensive annotations, which are typically task-specific, hampering reusability and limiting sharing. In this study we introduced SDoH-GPT, a simple and effective few-shot Large Language Model (LLM) method leveraging contrastive examples and concise instructions to extract SDoH without relying on extensive medical annotations or costly human intervention. It achieved tenfold and twentyfold reductions in time and cost respectively, and superior consistency with human annotators measured by Cohen's kappa of up to 0.92. The innovative combination of SDoH-GPT and XGBoost leverages the strengths of both, ensuring high accuracy and computational efficiency while consistently maintaining 0.90+ AUROC scores. Testing across three distinct datasets has confirmed its robustness and accuracy. This study highlights the potential of leveraging LLMs to revolutionize medical note classification, demonstrating their capability to achieve highly accurate classifications with significantly reduced time and cost.

Read more

7/25/2024

Extracting Social Determinants of Health from Pediatric Patient Notes Using Large Language Models: Novel Corpus and Methods
Total Score

0

Extracting Social Determinants of Health from Pediatric Patient Notes Using Large Language Models: Novel Corpus and Methods

Yujuan Fu, Giridhar Kaushik Ramachandran, Nicholas J Dobbins, Namu Park, Michael Leu, Abby R. Rosenberg, Kevin Lybarger, Fei Xia, Ozlem Uzuner, Meliha Yetisgen

Social determinants of health (SDoH) play a critical role in shaping health outcomes, particularly in pediatric populations where interventions can have long-term implications. SDoH are frequently studied in the Electronic Health Record (EHR), which provides a rich repository for diverse patient data. In this work, we present a novel annotated corpus, the Pediatric Social History Annotation Corpus (PedSHAC), and evaluate the automatic extraction of detailed SDoH representations using fine-tuned and in-context learning methods with Large Language Models (LLMs). PedSHAC comprises annotated social history sections from 1,260 clinical notes obtained from pediatric patients within the University of Washington (UW) hospital system. Employing an event-based annotation scheme, PedSHAC captures ten distinct health determinants to encompass living and economic stability, prior trauma, education access, substance use history, and mental health with an overall annotator agreement of 81.9 F1. Our proposed fine-tuning LLM-based extractors achieve high performance at 78.4 F1 for event arguments. In-context learning approaches with GPT-4 demonstrate promise for reliable SDoH extraction with limited annotated examples, with extraction performance at 82.3 F1 for event triggers.

Read more

4/5/2024

Large Language Models for Integrating Social Determinant of Health Data: A Case Study on Heart Failure 30-Day Readmission Prediction
Total Score

0

Large Language Models for Integrating Social Determinant of Health Data: A Case Study on Heart Failure 30-Day Readmission Prediction

Chase Fensore, Rodrigo M. Carrillo-Larco, Shivani A. Patel, Alanna A. Morris, Joyce C. Ho

Social determinants of health (SDOH) $-$ the myriad of circumstances in which people live, grow, and age $-$ play an important role in health outcomes. However, existing outcome prediction models often only use proxies of SDOH as features. Recent open data initiatives present an opportunity to construct a more comprehensive view of SDOH, but manually integrating the most relevant data for individual patients becomes increasingly challenging as the volume and diversity of public SDOH data grows. Large language models (LLMs) have shown promise at automatically annotating structured data. Here, we conduct an end-to-end case study evaluating the feasibility of using LLMs to integrate SDOH data, and the utility of these SDOH features for clinical prediction. We first manually label 700+ variables from two publicly-accessible SDOH data sources to one of five semantic SDOH categories. Then, we benchmark performance of 9 open-source LLMs on this classification task. Finally, we train ML models to predict 30-day hospital readmission among 39k heart failure (HF) patients, and we compare the prediction performance of the categorized SDOH variables with standard clinical variables. Additionally, we investigate the impact of few-shot LLM prompting on LLM annotation performance, and perform a metadata ablation study on prompts to evaluate which information helps LLMs accurately annotate these variables. We find that some open-source LLMs can effectively, accurately annotate SDOH variables with zero-shot prompting without the need for fine-tuning. Crucially, when combined with standard clinical features, the LLM-annotated Neighborhood and Built Environment subset of the SDOH variables shows the best performance predicting 30-day readmission of HF patients.

Read more

7/16/2024

Leveraging Open-Source Large Language Models for encoding Social Determinants of Health using an Intelligent Router
Total Score

0

Leveraging Open-Source Large Language Models for encoding Social Determinants of Health using an Intelligent Router

Akul Goel, Surya Narayanan Hari, Belinda Waltman, Matt Thomson

Social Determinants of Health (SDOH) play a significant role in patient health outcomes. The Center of Disease Control (CDC) introduced a subset of ICD-10 codes called Z-codes in an attempt to officially recognize and measure SDOH in the health care system. However, these codes are rarely annotated in a patient's Electronic Health Record (EHR), and instead, in many cases, need to be inferred from clinical notes. Previous research has shown that large language models (LLMs) show promise on extracting unstructured data from EHRs. However, with thousands of models to choose from with unique architectures and training sets, it's difficult to choose one model that performs the best on coding tasks. Further, clinical notes contain trusted health information making the use of closed-source language models from commercial vendors difficult, so the identification of open source LLMs that can be run within health organizations and exhibits high performance on SDOH tasks is an urgent problem. Here, we introduce an intelligent routing system for SDOH coding that uses a language model router to direct medical record data to open source LLMs that demonstrate optimal performance on specific SDOH codes. The intelligent routing system exhibits state of the art performance of 97.4% accuracy averaged across 5 codes, including homelessness and food insecurity, on par with closed models such as GPT-4o. In order to train the routing system and validate models, we also introduce a synthetic data generation and validation paradigm to increase the scale of training data without needing privacy protected medical records. Together, we demonstrate an architecture for intelligent routing of inputs to task-optimal language models to achieve high performance across a set of medical coding sub-tasks.

Read more

5/31/2024