High-dimensional multiple imputation (HDMI) for partially observed confounders including natural language processing-derived auxiliary covariates

2405.10925

Published 5/20/2024 by Janick Weberpals, Pamela A. Shaw, Kueiyu Joshua Lin, Richard Wyss, Joseph M Plasek, Li Zhou, Kerry Ngan, Thomas DeRamus, Sudha R. Raman, Bradley G. Hammill and 10 others

cs.AI cs.LG

🌿

Abstract

Multiple imputation (MI) models can be improved by including auxiliary covariates (AC), but their performance in high-dimensional data is not well understood. We aimed to develop and compare high-dimensional MI (HDMI) approaches using structured and natural language processing (NLP)-derived AC in studies with partially observed confounders. We conducted a plasmode simulation study using data from opioid vs. non-steroidal anti-inflammatory drug (NSAID) initiators (X) with observed serum creatinine labs (Z2) and time-to-acute kidney injury as outcome. We simulated 100 cohorts with a null treatment effect, including X, Z2, atrial fibrillation (U), and 13 other investigator-derived confounders (Z1) in the outcome generation. We then imposed missingness (MZ2) on 50% of Z2 measurements as a function of Z2 and U and created different HDMI candidate AC using structured and NLP-derived features. We mimicked scenarios where U was unobserved by omitting it from all AC candidate sets. Using LASSO, we data-adaptively selected HDMI covariates associated with Z2 and MZ2 for MI, and with U to include in propensity score models. The treatment effect was estimated following propensity score matching in MI datasets and we benchmarked HDMI approaches against a baseline imputation and complete case analysis with Z1 only. HDMI using claims data showed the lowest bias (0.072). Combining claims and sentence embeddings led to an improvement in the efficiency displaying the lowest root-mean-squared-error (0.173) and coverage (94%). NLP-derived AC alone did not perform better than baseline MI. HDMI approaches may decrease bias in studies with partially observed confounders where missingness depends on unobserved factors.

Create account to get full access

Overview

The paper explores methods for improving multiple imputation (MI) models by including auxiliary covariates (AC) in high-dimensional data settings.
The researchers conducted a simulation study using data on opioid vs. non-steroidal anti-inflammatory drug (NSAID) initiators, with partially observed serum creatinine levels and time-to-acute kidney injury as the outcome.
They compared different high-dimensional MI (HDMI) approaches using structured and natural language processing (NLP)-derived AC, in scenarios where an important confounder (atrial fibrillation) was unobserved.

Plain English Explanation

When researchers are missing some of the data they need for a study, they can use a technique called multiple imputation to estimate the missing values. This can help improve the accuracy of their analysis. However, in studies with a large number of factors (high-dimensional data), it's not always clear how to choose the best additional information (auxiliary covariates) to include in the imputation process.

In this paper, the researchers used a simulated dataset to test different approaches for selecting auxiliary covariates, including using information from medical claims data and natural language processing of clinical notes. They were particularly interested in situations where an important factor (atrial fibrillation) was not directly observed, but could potentially be inferred from other data.

The key finding was that using a combination of structured claims data and natural language features led to the most accurate and efficient estimates of the treatment effect, compared to simpler imputation methods. This suggests that leveraging diverse data sources can help overcome the challenges of missing confounding factors in high-dimensional studies.

Technical Explanation

The researchers conducted a plasmode simulation study, using real-world data on opioid vs. NSAID initiators as the basis for generating 100 simulated cohorts. They included the observed serum creatinine lab results (Z2) and time-to-acute kidney injury as the outcome, as well as an unobserved confounder (atrial fibrillation, U) and 13 other investigator-derived confounders (Z1) in the outcome generation process.

To mimic a scenario with partially observed data, the researchers imposed 50% missingness on the Z2 measurements, with the missingness depending on Z2 and the unobserved U. They then created different HDMI candidate AC using structured claims data and NLP-derived features from clinical notes.

Using LASSO, the researchers data-adaptively selected HDMI covariates associated with the observed Z2 and the missing Z2 (MZ2) for the MI process, as well as with the unobserved U to include in the propensity score models. They then estimated the treatment effect following propensity score matching in the MI datasets and benchmarked the HDMI approaches against a baseline imputation and complete case analysis using only the Z1 covariates.

Critical Analysis

The paper provides a comprehensive evaluation of different HDMI approaches in a simulation setting with partially observed confounders. The researchers' use of real-world data as the basis for the simulation is a strength, as it increases the relevance and applicability of the findings.

However, the simulation study is limited to a single outcome (acute kidney injury) and may not generalize to other settings. Additionally, the performance of the NLP-derived features alone was not better than the baseline MI, suggesting that further research is needed to optimize the use of unstructured data sources in this context.

It would also be interesting to explore the impact of different missingness mechanisms and the robustness of the HDMI approaches to model misspecification.

Conclusion

This study demonstrates that incorporating diverse data sources, including structured claims data and NLP-derived features, can improve the performance of multiple imputation models in high-dimensional settings with partially observed confounders. The findings have important implications for conducting robust causal analyses in complex healthcare datasets, where the full set of influential factors may not be directly observed.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

📊

Imputation of missing values in multi-view data

Wouter van Loon, Marjolein Fokkema, Frank de Vos, Marisa Koini, Reinhold Schmidt, Mark de Rooij

Data for which a set of objects is described by multiple distinct feature sets (called views) is known as multi-view data. When missing values occur in multi-view data, all features in a view are likely to be missing simultaneously. This may lead to very large quantities of missing data which, especially when combined with high-dimensionality, can make the application of conditional imputation methods computationally infeasible. However, the multi-view structure could be leveraged to reduce the complexity and computational load of imputation. We introduce a new imputation method based on the existing stacked penalized logistic regression (StaPLR) algorithm for multi-view learning. It performs imputation in a dimension-reduced space to address computational challenges inherent to the multi-view context. We compare the performance of the new imputation method with several existing imputation algorithms in simulated data sets and a real data application. The results show that the new imputation method leads to competitive results at a much lower computational cost, and makes the use of advanced imputation algorithms such as missForest and predictive mean matching possible in settings where they would otherwise be computationally infeasible.

6/21/2024

stat.ML cs.LG

🛠️

Multilevel Stochastic Optimization for Imputation in Massive Medical Data Records

Wenrui Li, Xiaoyu Wang, Yuetian Sun, Snezana Milanovic, Mark Kon, Julio Enrique Castrillon-Candas

It has long been a recognized problem that many datasets contain significant levels of missing numerical data. A potentially critical predicate for application of machine learning methods to datasets involves addressing this problem. However, this is a challenging task. In this paper, we apply a recently developed multi-level stochastic optimization approach to the problem of imputation in massive medical records. The approach is based on computational applied mathematics techniques and is highly accurate. In particular, for the Best Linear Unbiased Predictor (BLUP) this multi-level formulation is exact, and is significantly faster and more numerically stable. This permits practical application of Kriging methods to data imputation problems for massive datasets. We test this approach on data from the National Inpatient Sample (NIS) data records, Healthcare Cost and Utilization Project (HCUP), Agency for Healthcare Research and Quality. Numerical results show that the multi-level method significantly outperforms current approaches and is numerically robust. It has superior accuracy as compared with methods recommended in the recent report from HCUP. Benchmark tests show up to 75% reductions in error. Furthermore, the results are also superior to recent state of the art methods such as discriminative deep learning.

4/4/2024

cs.LG

🤯

Simultaneous inference for generalized linear models with unmeasured confounders

Jin-Hong Du, Larry Wasserman, Kathryn Roeder

Tens of thousands of simultaneous hypothesis tests are routinely performed in genomic studies to identify differentially expressed genes. However, due to unmeasured confounders, many standard statistical approaches may be substantially biased. This paper investigates the large-scale hypothesis testing problem for multivariate generalized linear models in the presence of confounding effects. Under arbitrary confounding mechanisms, we propose a unified statistical estimation and inference framework that harnesses orthogonal structures and integrates linear projections into three key stages. It begins by disentangling marginal and uncorrelated confounding effects to recover the latent coefficients. Subsequently, latent factors and primary effects are jointly estimated through lasso-type optimization. Finally, we incorporate projected and weighted bias-correction steps for hypothesis testing. Theoretically, we establish the identification conditions of various effects and non-asymptotic error bounds. We show effective Type-I error control of asymptotic $z$-tests as sample and response sizes approach infinity. Numerical experiments demonstrate that the proposed method controls the false discovery rate by the Benjamini-Hochberg procedure and is more powerful than alternative methods. By comparing single-cell RNA-seq counts from two groups of samples, we demonstrate the suitability of adjusting confounding effects when significant covariates are absent from the model.

4/23/2024

cs.LG stat.ML

🤯

New!Inherent Challenges of Post-Hoc Membership Inference for Large Language Models

Matthieu Meeus, Shubham Jain, Marek Rei, Yves-Alexandre de Montjoye

Large Language Models (LLMs) are often trained on vast amounts of undisclosed data, motivating the development of post-hoc Membership Inference Attacks (MIAs) to gain insight into their training data composition. However, in this paper, we identify inherent challenges in post-hoc MIA evaluation due to potential distribution shifts between collected member and non-member datasets. Using a simple bag-of-words classifier, we demonstrate that datasets used in recent post-hoc MIAs suffer from significant distribution shifts, in some cases achieving near-perfect distinction between members and non-members. This implies that previously reported high MIA performance may be largely attributable to these shifts rather than model memorization. We confirm that randomized, controlled setups eliminate such shifts and thus enable the development and fair evaluation of new MIAs. However, we note that such randomized setups are rarely available for the latest LLMs, making post-hoc data collection still required to infer membership for real-world LLMs. As a potential solution, we propose a Regression Discontinuity Design (RDD) approach for post-hoc data collection, which substantially mitigates distribution shifts. Evaluating various MIA methods on this RDD setup yields performance barely above random guessing, in stark contrast to previously reported results. Overall, our findings highlight the challenges in accurately measuring LLM memorization and the need for careful experimental design in (post-hoc) membership inference tasks.

6/27/2024

cs.CL cs.CR cs.LG