High-dimensional multiple imputation (HDMI) for partially observed confounders including natural language processing-derived auxiliary covariates






Published 5/20/2024 by Janick Weberpals, Pamela A. Shaw, Kueiyu Joshua Lin, Richard Wyss, Joseph M Plasek, Li Zhou, Kerry Ngan, Thomas DeRamus, Sudha R. Raman, Bradley G. Hammill and 10 others



Multiple imputation (MI) models can be improved by including auxiliary covariates (AC), but their performance in high-dimensional data is not well understood. We aimed to develop and compare high-dimensional MI (HDMI) approaches using structured and natural language processing (NLP)-derived AC in studies with partially observed confounders. We conducted a plasmode simulation study using data from opioid vs. non-steroidal anti-inflammatory drug (NSAID) initiators (X) with observed serum creatinine labs (Z2) and time-to-acute kidney injury as outcome. We simulated 100 cohorts with a null treatment effect, including X, Z2, atrial fibrillation (U), and 13 other investigator-derived confounders (Z1) in the outcome generation. We then imposed missingness (MZ2) on 50% of Z2 measurements as a function of Z2 and U and created different HDMI candidate AC using structured and NLP-derived features. We mimicked scenarios where U was unobserved by omitting it from all AC candidate sets. Using LASSO, we data-adaptively selected HDMI covariates associated with Z2 and MZ2 for MI, and with U to include in propensity score models. The treatment effect was estimated following propensity score matching in MI datasets and we benchmarked HDMI approaches against a baseline imputation and complete case analysis with Z1 only. HDMI using claims data showed the lowest bias (0.072). Combining claims and sentence embeddings led to an improvement in the efficiency displaying the lowest root-mean-squared-error (0.173) and coverage (94%). NLP-derived AC alone did not perform better than baseline MI. HDMI approaches may decrease bias in studies with partially observed confounders where missingness depends on unobserved factors.

  • The paper explores methods for improving multiple imputation (MI) models by including auxiliary covariates (AC) in high-dimensional data settings.
  • The researchers conducted a simulation study using data on opioid vs. non-steroidal anti-inflammatory drug (NSAID) initiators, with partially observed serum creatinine levels and time-to-acute kidney injury as the outcome.
  • They compared different high-dimensional MI (HDMI) approaches using structured and natural language processing (NLP)-derived AC, in scenarios where an important confounder (atrial fibrillation) was unobserved.

Plain English Explanation

When researchers are missing some of the data they need for a study, they can use a technique called multiple imputation to estimate the missing values. This can help improve the accuracy of their analysis. However, in studies with a large number of factors (high-dimensional data), it's not always clear how to choose the best additional information (auxiliary covariates) to include in the imputation process.

In this paper, the researchers used a simulated dataset to test different approaches for selecting auxiliary covariates, including using information from medical claims data and natural language processing of clinical notes. They were particularly interested in situations where an important factor (atrial fibrillation) was not directly observed, but could potentially be inferred from other data.

The key finding was that using a combination of structured claims data and natural language features led to the most accurate and efficient estimates of the treatment effect, compared to simpler imputation methods. This suggests that leveraging diverse data sources can help overcome the challenges of missing confounding factors in high-dimensional studies.

Technical Explanation

The researchers conducted a plasmode simulation study, using real-world data on opioid vs. NSAID initiators as the basis for generating 100 simulated cohorts. They included the observed serum creatinine lab results (Z2) and time-to-acute kidney injury as the outcome, as well as an unobserved confounder (atrial fibrillation, U) and 13 other investigator-derived confounders (Z1) in the outcome generation process.

To mimic a scenario with partially observed data, the researchers imposed 50% missingness on the Z2 measurements, with the missingness depending on Z2 and the unobserved U. They then created different HDMI candidate AC using structured claims data and NLP-derived features from clinical notes.

Using LASSO, the researchers data-adaptively selected HDMI covariates associated with the observed Z2 and the missing Z2 (MZ2) for the MI process, as well as with the unobserved U to include in the propensity score models. They then estimated the treatment effect following propensity score matching in the MI datasets and benchmarked the HDMI approaches against a baseline imputation and complete case analysis using only the Z1 covariates.

Critical Analysis

The paper provides a comprehensive evaluation of different HDMI approaches in a simulation setting with partially observed confounders. The researchers' use of real-world data as the basis for the simulation is a strength, as it increases the relevance and applicability of the findings.

However, the simulation study is limited to a single outcome (acute kidney injury) and may not generalize to other settings. Additionally, the performance of the NLP-derived features alone was not better than the baseline MI, suggesting that further research is needed to optimize the use of unstructured data sources in this context.

It would also be interesting to explore the impact of different missingness mechanisms and the robustness of the HDMI approaches to model misspecification.


This study demonstrates that incorporating diverse data sources, including structured claims data and NLP-derived features, can improve the performance of multiple imputation models in high-dimensional settings with partially observed confounders. The findings have important implications for conducting robust causal analyses in complex healthcare datasets, where the full set of influential factors may not be directly observed.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

