Global Contrastive Training for Multimodal Electronic Health Records with Language Supervision

2404.06723

Published 4/11/2024 by Yingbo Ma, Suraj Kolla, Zhenhong Hu, Dhruv Kaliraman, Victoria Nolan, Ziyuan Guan, Yuanfang Ren, Brooke Armfield, Tezcan Ozrazgat-Baslanti, Jeremy A. Balch and 4 others

cs.LG cs.CL

Global Contrastive Training for Multimodal Electronic Health Records with Language Supervision

Abstract

Modern electronic health records (EHRs) hold immense promise in tracking personalized patient health trajectories through sequential deep learning, owing to their extensive breadth, scale, and temporal granularity. Nonetheless, how to effectively leverage multiple modalities from EHRs poses significant challenges, given its complex characteristics such as high dimensionality, multimodality, sparsity, varied recording frequencies, and temporal irregularities. To this end, this paper introduces a novel multimodal contrastive learning framework, specifically focusing on medical time series and clinical notes. To tackle the challenge of sparsity and irregular time intervals in medical time series, the framework integrates temporal cross-attention transformers with a dynamic embedding and tokenization scheme for learning multimodal feature representations. To harness the interconnected relationships between medical time series and clinical notes, the framework equips a global contrastive loss, aligning a patient's multimodal feature representations with the corresponding discharge summaries. Since discharge summaries uniquely pertain to individual patients and represent a holistic view of the patient's hospital stay, machine learning models are led to learn discriminative multimodal features via global contrasting. Extensive experiments with a real-world EHR dataset demonstrated that our framework outperformed state-of-the-art approaches on the exemplar task of predicting the occurrence of nine postoperative complications for more than 120,000 major inpatient surgeries using multimodal data from UF health system split among three hospitals (UF Health Gainesville, UF Health Jacksonville, and UF Health Jacksonville-North).

Create account to get full access

Overview

This paper presents a novel approach for learning multimodal representations of electronic health records (EHRs) using contrastive learning with language supervision.
The method aims to capture global, high-level relationships between different modalities (e.g., text, images, structured data) in EHRs to improve downstream tasks like disease prediction.
The authors evaluate their approach on several EHR datasets and show it outperforms existing multimodal and unimodal baselines.

Plain English Explanation

Electronic health records (EHRs) contain a wealth of information about patients, including text notes, medical images, and structured data like lab results. Temporal Cross-Attention for Dynamic Embedding and Tokenization in Multimodal and Voice-EHR: Introducing Multimodal Audio Data to Health have explored using machine learning to extract insights from these diverse data sources.

This paper proposes a new approach called "global contrastive training" that aims to capture the high-level relationships between the different modalities in EHRs. The key idea is to train a model to recognize when different parts of an EHR (e.g., text, images, structured data) are from the same patient, even if they don't directly correspond to each other. This allows the model to learn representations that reflect the underlying connections between the modalities, which can then be used to improve performance on downstream tasks like predicting a patient's disease.

The authors evaluate their approach on several EHR datasets and show that it outperforms existing multimodal and unimodal (single modality) baselines. This suggests that their global contrastive training method is effectively capturing the rich, cross-modal information in EHRs, which could have important implications for improving healthcare applications that rely on these complex data sources.

Technical Explanation

The core of the authors' approach is a global contrastive training framework that learns multimodal representations of EHRs by exploiting the natural alignment between different modalities for the same patient. Specifically, the model is trained to recognize when different elements of an EHR (e.g., text notes, medical images, structured data) belong to the same patient, even if they are not directly paired.

This is achieved through a contrastive loss function that encourages the model to bring together representations of modalities from the same patient (positive pairs) while pushing apart representations of modalities from different patients (negative pairs). Importantly, the authors incorporate language supervision by using the text notes as a guiding signal to help the model learn more meaningful cross-modal representations.

Developing Healthcare Language Model Embedding Spaces has explored using language models to improve multimodal learning in healthcare. Similarly, this paper shows that leveraging the rich semantic information in text notes can help the model better capture the high-level relationships between the different modalities in EHRs.

The authors evaluate their global contrastive training approach on several EHR datasets, comparing it to various unimodal and multimodal baselines. Their results demonstrate that the proposed method outperforms these alternatives on downstream tasks like disease prediction, highlighting the benefits of learning global, cross-modal representations of EHRs.

Critical Analysis

The authors acknowledge several limitations of their work. First, the global contrastive training approach relies on the assumption that different modalities from the same patient are naturally aligned, which may not always be the case in practice. Mitigating Heterogeneity in Federated Multimodal Learning for Biomedical Vision has discussed the challenges of dealing with heterogeneous data in multimodal learning.

Additionally, the authors' experiments focus on relatively simple downstream tasks, and it remains to be seen how well the learned representations will generalize to more complex healthcare applications. Further research is needed to understand the broader applicability and limitations of the global contrastive training approach.

Another potential concern is the reliance on language supervision, which may limit the method's generalizability to EHR datasets with limited or poor-quality text data. Exploring alternative ways of incorporating different modalities, such as Design as Desired: Utilizing Visual Question Answering, could be a fruitful direction for future work.

Conclusion

This paper presents a novel global contrastive training approach for learning multimodal representations of electronic health records (EHRs) with language supervision. The key idea is to capture the high-level relationships between different modalities (text, images, structured data) by training the model to recognize when they belong to the same patient, even if they are not directly paired.

The authors' results suggest that this approach can outperform existing unimodal and multimodal baselines on downstream tasks like disease prediction, highlighting the benefits of learning global, cross-modal representations of complex healthcare data. While the method has some limitations, it represents an important step towards more effective utilization of the rich, multimodal information contained in EHRs to improve healthcare applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Next Visit Diagnosis Prediction via Medical Code-Centric Multimodal Contrastive EHR Modelling with Hierarchical Regularisation

Heejoon Koo

Predicting next visit diagnosis using Electronic Health Records (EHR) is an essential task in healthcare, critical for devising proactive future plans for both healthcare providers and patients. Nonetheless, many preceding studies have not sufficiently addressed the heterogeneous and hierarchical characteristics inherent in EHR data, inevitably leading to sub-optimal performance. To this end, we propose NECHO, a novel medical code-centric multimodal contrastive EHR learning framework with hierarchical regularisation. First, we integrate multifaceted information encompassing medical codes, demographics, and clinical notes using a tailored network design and a pair of bimodal contrastive losses, all of which pivot around a medical codes representation. We also regularise modality-specific encoders using a parental level information in medical ontology to learn hierarchical structure of EHR data. A series of experiments on MIMIC-III data demonstrates effectiveness of our approach.

5/2/2024

cs.LG cs.AI cs.IR

FlexCare: Leveraging Cross-Task Synergy for Flexible Multimodal Healthcare Prediction

Muhao Xu, Zhenfeng Zhu, Youru Li, Shuai Zheng, Yawei Zhao, Kunlun He, Yao Zhao

Multimodal electronic health record (EHR) data can offer a holistic assessment of a patient's health status, supporting various predictive healthcare tasks. Recently, several studies have embraced the multitask learning approach in the healthcare domain, exploiting the inherent correlations among clinical tasks to predict multiple outcomes simultaneously. However, existing methods necessitate samples to possess complete labels for all tasks, which places heavy demands on the data and restricts the flexibility of the model. Meanwhile, within a multitask framework with multimodal inputs, how to comprehensively consider the information disparity among modalities and among tasks still remains a challenging problem. To tackle these issues, a unified healthcare prediction model, also named by textbf{FlexCare}, is proposed to flexibly accommodate incomplete multimodal inputs, promoting the adaption to multiple healthcare tasks. The proposed model breaks the conventional paradigm of parallel multitask prediction by decomposing it into a series of asynchronous single-task prediction. Specifically, a task-agnostic multimodal information extraction module is presented to capture decorrelated representations of diverse intra- and inter-modality patterns. Taking full account of the information disparities between different modalities and different tasks, we present a task-guided hierarchical multimodal fusion module that integrates the refined modality-level representations into an individual patient-level representation. Experimental results on multiple tasks from MIMIC-IV/MIMIC-CXR/MIMIC-NOTE datasets demonstrate the effectiveness of the proposed method. Additionally, further analysis underscores the feasibility and potential of employing such a multitask strategy in the healthcare domain. The source code is available at https://github.com/mhxu1998/FlexCare.

6/19/2024

cs.LG cs.AI

Temporal Cross-Attention for Dynamic Embedding and Tokenization of Multimodal Electronic Health Records

Yingbo Ma, Suraj Kolla, Dhruv Kaliraman, Victoria Nolan, Zhenhong Hu, Ziyuan Guan, Yuanfang Ren, Brooke Armfield, Tezcan Ozrazgat-Baslanti, Tyler J. Loftus, Parisa Rashidi, Azra Bihorac, Benjamin Shickel

The breadth, scale, and temporal granularity of modern electronic health records (EHR) systems offers great potential for estimating personalized and contextual patient health trajectories using sequential deep learning. However, learning useful representations of EHR data is challenging due to its high dimensionality, sparsity, multimodality, irregular and variable-specific recording frequency, and timestamp duplication when multiple measurements are recorded simultaneously. Although recent efforts to fuse structured EHR and unstructured clinical notes suggest the potential for more accurate prediction of clinical outcomes, less focus has been placed on EHR embedding approaches that directly address temporal EHR challenges by learning time-aware representations from multimodal patient time series. In this paper, we introduce a dynamic embedding and tokenization framework for precise representation of multimodal clinical time series that combines novel methods for encoding time and sequential position with temporal cross-attention. Our embedding and tokenization framework, when integrated into a multitask transformer classifier with sliding window attention, outperformed baseline approaches on the exemplar task of predicting the occurrence of nine postoperative complications of more than 120,000 major inpatient surgeries using multimodal data from three hospitals and two academic health centers in the United States.

4/3/2024

cs.LG

EMERGE: Integrating RAG for Improved Multimodal EHR Predictive Modeling

Yinghao Zhu, Changyu Ren, Zixiang Wang, Xiaochen Zheng, Shiyun Xie, Junlan Feng, Xi Zhu, Zhoujun Li, Liantao Ma, Chengwei Pan

The integration of multimodal Electronic Health Records (EHR) data has notably advanced clinical predictive capabilities. However, current models that utilize clinical notes and multivariate time-series EHR data often lack the necessary medical context for precise clinical tasks. Previous methods using knowledge graphs (KGs) primarily focus on structured knowledge extraction. To address this, we propose EMERGE, a Retrieval-Augmented Generation (RAG) driven framework aimed at enhancing multimodal EHR predictive modeling. Our approach extracts entities from both time-series data and clinical notes by prompting Large Language Models (LLMs) and aligns them with professional PrimeKG to ensure consistency. Beyond triplet relationships, we include entities' definitions and descriptions to provide richer semantics. The extracted knowledge is then used to generate task-relevant summaries of patients' health statuses. These summaries are fused with other modalities utilizing an adaptive multimodal fusion network with cross-attention. Extensive experiments on the MIMIC-III and MIMIC-IV datasets for in-hospital mortality and 30-day readmission tasks demonstrate the superior performance of the EMERGE framework compared to baseline models. Comprehensive ablation studies and analyses underscore the efficacy of each designed module and the framework's robustness to data sparsity. EMERGE significantly enhances the use of multimodal EHR data in healthcare, bridging the gap with nuanced medical contexts crucial for informed clinical predictions.

6/4/2024

cs.CL cs.AI cs.LG