Extracting Social Determinants of Health from Pediatric Patient Notes Using Large Language Models: Novel Corpus and Methods

Read original: arXiv:2404.00826 - Published 4/5/2024 by Yujuan Fu, Giridhar Kaushik Ramachandran, Nicholas J Dobbins, Namu Park, Michael Leu, Abby R. Rosenberg, Kevin Lybarger, Fei Xia, Ozlem Uzuner, Meliha Yetisgen

Extracting Social Determinants of Health from Pediatric Patient Notes Using Large Language Models: Novel Corpus and Methods

Overview

This paper introduces a novel corpus and methods for extracting social determinants of health (SDOH) from pediatric patient notes using large language models.
The researchers developed a corpus of over 30,000 clinical notes and corresponding SDOH annotations to train and evaluate language models for this task.
They tested different model architectures and found that a combination of transformer-based language models and custom SDOH classification heads achieved strong performance on SDOH extraction.

Plain English Explanation

The paper focuses on developing better ways to identify important social and environmental factors that can impact a child's health and well-being from their medical records. These "social determinants of health" can include things like a family's income, education level, housing situation, access to food, and other non-medical issues that play a big role in a child's overall health.

To do this, the researchers created a large dataset of real pediatric medical notes and had experts annotate them to identify references to social determinants of health. They then used this dataset to train and test different artificial intelligence (AI) language models - systems that can read and analyze large amounts of text. The goal was to see if these AI models could accurately detect and extract information about social determinants of health from the medical notes, which could help healthcare providers better understand and address these important factors.

The researchers found that a combination of powerful language models and custom machine learning components designed specifically for social determinants of health worked best for this task. This suggests that AI could be a valuable tool for automatically identifying social health factors from medical records, which could lead to improved patient care and outcomes, especially for children facing social and economic challenges.

Technical Explanation

The paper describes the development of a novel corpus and methods for extracting social determinants of health (SDOH) from pediatric patient notes using large language models. The researchers created a corpus of over 30,000 clinical notes with corresponding SDOH annotations, covering a wide range of social and environmental factors that can influence a child's health and well-being.

They then evaluated the performance of different language model architectures, including transformer-based models like BERT and RoBERTa, on the task of SDOH extraction from the clinical notes. To optimize the performance, they experimented with adding custom classification heads on top of the language models, specifically designed to identify different SDOH categories.

The results showed that the combined approach of using powerful transformer-based language models coupled with custom SDOH classification components achieved strong performance on the SDOH extraction task, significantly outperforming baseline models. This suggests that this type of AI-powered system could be a valuable tool for healthcare providers to automatically identify and extract important social determinants of health information from pediatric medical records.

Critical Analysis

The paper presents a comprehensive approach to SDOH extraction from pediatric patient notes, including the development of a novel dataset and the exploration of various language model architectures. The creation of a high-quality annotated corpus is a significant contribution, as it provides a valuable resource for future research in this area.

However, the paper acknowledges a few limitations. The dataset is primarily focused on a single healthcare system, which may limit the generalizability of the findings. Additionally, the annotations were performed by a small team of experts, which could introduce potential biases or inconsistencies. Further validation of the corpus and the methods on a more diverse set of clinical notes would be beneficial.

The paper also does not delve into the potential ethical implications of using AI systems for SDOH extraction, such as concerns around data privacy, algorithmic bias, and the appropriate use of this information in clinical decision-making. These are important considerations that future research should address.

Conclusion

This paper presents a novel approach to extracting social determinants of health from pediatric patient notes using large language models. The researchers developed a high-quality annotated dataset and demonstrated that a combination of transformer-based language models and custom SDOH classification components can achieve strong performance on this task.

The ability to automatically identify and extract SDOH information from medical records has the potential to significantly improve patient care and outcomes, especially for children facing social and economic challenges. By integrating this type of AI-powered system into clinical workflows, healthcare providers could gain valuable insights into the broader social and environmental factors influencing their patients' health, enabling more holistic and effective interventions.

The paper's findings represent an important step forward in the field of clinical natural language processing and its application to social determinants of health. Further research and development in this area could lead to transformative changes in how healthcare systems address the complex interplay between social factors and individual health.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Extracting Social Determinants of Health from Pediatric Patient Notes Using Large Language Models: Novel Corpus and Methods

Yujuan Fu, Giridhar Kaushik Ramachandran, Nicholas J Dobbins, Namu Park, Michael Leu, Abby R. Rosenberg, Kevin Lybarger, Fei Xia, Ozlem Uzuner, Meliha Yetisgen

Social determinants of health (SDoH) play a critical role in shaping health outcomes, particularly in pediatric populations where interventions can have long-term implications. SDoH are frequently studied in the Electronic Health Record (EHR), which provides a rich repository for diverse patient data. In this work, we present a novel annotated corpus, the Pediatric Social History Annotation Corpus (PedSHAC), and evaluate the automatic extraction of detailed SDoH representations using fine-tuned and in-context learning methods with Large Language Models (LLMs). PedSHAC comprises annotated social history sections from 1,260 clinical notes obtained from pediatric patients within the University of Washington (UW) hospital system. Employing an event-based annotation scheme, PedSHAC captures ten distinct health determinants to encompass living and economic stability, prior trauma, education access, substance use history, and mental health with an overall annotator agreement of 81.9 F1. Our proposed fine-tuning LLM-based extractors achieve high performance at 78.4 F1 for event arguments. In-context learning approaches with GPT-4 demonstrate promise for reliable SDoH extraction with limited annotated examples, with extraction performance at 82.3 F1 for event triggers.

4/5/2024

💬

SDoH-GPT: Using Large Language Models to Extract Social Determinants of Health (SDoH)

Bernardo Consoli, Xizhi Wu, Song Wang, Xinyu Zhao, Yanshan Wang, Justin Rousseau, Tom Hartvigsen, Li Shen, Huanmei Wu, Yifan Peng, Qi Long, Tianlong Chen, Ying Ding

Extracting social determinants of health (SDoH) from unstructured medical notes depends heavily on labor-intensive annotations, which are typically task-specific, hampering reusability and limiting sharing. In this study we introduced SDoH-GPT, a simple and effective few-shot Large Language Model (LLM) method leveraging contrastive examples and concise instructions to extract SDoH without relying on extensive medical annotations or costly human intervention. It achieved tenfold and twentyfold reductions in time and cost respectively, and superior consistency with human annotators measured by Cohen's kappa of up to 0.92. The innovative combination of SDoH-GPT and XGBoost leverages the strengths of both, ensuring high accuracy and computational efficiency while consistently maintaining 0.90+ AUROC scores. Testing across three distinct datasets has confirmed its robustness and accuracy. This study highlights the potential of leveraging LLMs to revolutionize medical note classification, demonstrating their capability to achieve highly accurate classifications with significantly reduced time and cost.

7/25/2024

Large Language Models for Integrating Social Determinant of Health Data: A Case Study on Heart Failure 30-Day Readmission Prediction

Chase Fensore, Rodrigo M. Carrillo-Larco, Shivani A. Patel, Alanna A. Morris, Joyce C. Ho

Social determinants of health (SDOH) $-$ the myriad of circumstances in which people live, grow, and age $-$ play an important role in health outcomes. However, existing outcome prediction models often only use proxies of SDOH as features. Recent open data initiatives present an opportunity to construct a more comprehensive view of SDOH, but manually integrating the most relevant data for individual patients becomes increasingly challenging as the volume and diversity of public SDOH data grows. Large language models (LLMs) have shown promise at automatically annotating structured data. Here, we conduct an end-to-end case study evaluating the feasibility of using LLMs to integrate SDOH data, and the utility of these SDOH features for clinical prediction. We first manually label 700+ variables from two publicly-accessible SDOH data sources to one of five semantic SDOH categories. Then, we benchmark performance of 9 open-source LLMs on this classification task. Finally, we train ML models to predict 30-day hospital readmission among 39k heart failure (HF) patients, and we compare the prediction performance of the categorized SDOH variables with standard clinical variables. Additionally, we investigate the impact of few-shot LLM prompting on LLM annotation performance, and perform a metadata ablation study on prompts to evaluate which information helps LLMs accurately annotate these variables. We find that some open-source LLMs can effectively, accurately annotate SDOH variables with zero-shot prompting without the need for fine-tuning. Crucially, when combined with standard clinical features, the LLM-annotated Neighborhood and Built Environment subset of the SDOH variables shows the best performance predicting 30-day readmission of HF patients.

7/16/2024

Leveraging Open-Source Large Language Models for encoding Social Determinants of Health using an Intelligent Router

Akul Goel, Surya Narayanan Hari, Belinda Waltman, Matt Thomson

Social Determinants of Health (SDOH) play a significant role in patient health outcomes. The Center of Disease Control (CDC) introduced a subset of ICD-10 codes called Z-codes in an attempt to officially recognize and measure SDOH in the health care system. However, these codes are rarely annotated in a patient's Electronic Health Record (EHR), and instead, in many cases, need to be inferred from clinical notes. Previous research has shown that large language models (LLMs) show promise on extracting unstructured data from EHRs. However, with thousands of models to choose from with unique architectures and training sets, it's difficult to choose one model that performs the best on coding tasks. Further, clinical notes contain trusted health information making the use of closed-source language models from commercial vendors difficult, so the identification of open source LLMs that can be run within health organizations and exhibits high performance on SDOH tasks is an urgent problem. Here, we introduce an intelligent routing system for SDOH coding that uses a language model router to direct medical record data to open source LLMs that demonstrate optimal performance on specific SDOH codes. The intelligent routing system exhibits state of the art performance of 97.4% accuracy averaged across 5 codes, including homelessness and food insecurity, on par with closed models such as GPT-4o. In order to train the routing system and validate models, we also introduce a synthetic data generation and validation paradigm to increase the scale of training data without needing privacy protected medical records. Together, we demonstrate an architecture for intelligent routing of inputs to task-optimal language models to achieve high performance across a set of medical coding sub-tasks.

5/31/2024