Calibrated and Conformal Propensity Scores for Causal Effect Estimation

2306.00382

Published 6/6/2024 by Shachi Deshpande, Volodymyr Kuleshov

🏅

Abstract

Propensity scores are commonly used to estimate treatment effects from observational data. We argue that the probabilistic output of a learned propensity score model should be calibrated -- i.e., a predictive treatment probability of 90% should correspond to 90% of individuals being assigned the treatment group -- and we propose simple recalibration techniques to ensure this property. We prove that calibration is a necessary condition for unbiased treatment effect estimation when using popular inverse propensity weighted and doubly robust estimators. We derive error bounds on causal effect estimates that directly relate to the quality of uncertainties provided by the probabilistic propensity score model and show that calibration strictly improves this error bound while also avoiding extreme propensity weights. We demonstrate improved causal effect estimation with calibrated propensity scores in several tasks including high-dimensional image covariates and genome-wide association studies (GWASs). Calibrated propensity scores improve the speed of GWAS analysis by more than two-fold by enabling the use of simpler models that are faster to train.

Create account to get full access

Overview

Propensity scores are used to estimate treatment effects from observational data
This paper argues that propensity score models should be calibrated to ensure their predictive probabilities match real-world outcomes
The authors propose recalibration techniques and prove that calibration is necessary for unbiased treatment effect estimation
They derive error bounds on causal effect estimates that show calibration improves the quality of uncertainty estimates and avoids extreme propensity weights
The paper demonstrates improved causal effect estimation with calibrated propensity scores in several tasks, including genomic studies

Plain English Explanation

When researchers want to understand the effects of a treatment or intervention using observational data (rather than a controlled experiment), they often turn to propensity scores. Propensity scores estimate the probability that an individual would receive the treatment based on their observed characteristics.

However, the authors of this paper argue that the propensity score model needs to be calibrated. This means that if the model says there is a 90% chance someone will receive the treatment, then 90% of those people should actually end up in the treatment group. Calibration is important because it ensures the propensity scores are accurately reflecting the real-world probabilities.

The authors prove that calibration is necessary for two popular methods used to estimate treatment effects: inverse propensity weighting and doubly robust estimation. They also show that calibration improves the uncertainty estimates of the causal effect and helps avoid extreme propensity weights that can skew the results.

The paper demonstrates that using calibrated propensity scores leads to better causal effect estimates in a variety of tasks, including image analysis and genome-wide association studies (GWASs). Calibration also enables the use of simpler, faster propensity score models for GWAS, speeding up the analysis by more than two-fold.

Technical Explanation

The key technical contributions of this paper are:

Propensity Score Calibration: The authors propose simple recalibration techniques to ensure the predictive probabilities of a learned propensity score model match the real-world treatment assignment rates.
Necessity of Calibration: They prove that calibration is a necessary condition for unbiased treatment effect estimation when using inverse propensity weighted and doubly robust estimators.
Error Bounds: The authors derive error bounds on causal effect estimates that directly relate to the quality of uncertainties provided by the propensity score model. They show that calibration strictly improves this error bound while also avoiding extreme propensity weights.
Empirical Evaluations: The paper demonstrates improved causal effect estimation with calibrated propensity scores in several tasks, including high-dimensional image covariates and genome-wide association studies (GWASs). Calibration enables the use of simpler, faster propensity score models for GWAS, accelerating the analysis by more than two-fold.

Critical Analysis

The paper provides a thorough theoretical and empirical analysis of the importance of propensity score calibration for unbiased treatment effect estimation. The authors make a compelling case that calibration is a necessary condition for popular causal inference methods to work as intended.

One potential limitation is that the paper focuses on binary treatment scenarios. It would be interesting to see how the calibration techniques extend to more complex, multi-class treatment settings. Additionally, the authors note that their error bounds rely on certain assumptions about the underlying data-generating process, and it's unclear how sensitive the results are to violations of these assumptions.

While the empirical evaluations demonstrate the benefits of calibration across several domains, it would be valuable to see the techniques applied to even more diverse datasets and real-world applications to further validate the generalizability of the findings.

Overall, this paper makes an important contribution to the causal inference literature by highlighting the critical role of propensity score calibration and providing practical solutions to address this issue. The insights and methods presented here should be carefully considered by researchers and practitioners working on causal effect estimation from observational data.

Conclusion

This paper makes a strong case for the importance of calibrating propensity score models when estimating treatment effects from observational data. The authors show that calibration is a necessary condition for unbiased causal inference using popular methods like inverse propensity weighting and doubly robust estimation.

By deriving error bounds that relate to the quality of the propensity score uncertainties, the paper demonstrates that calibration strictly improves the reliability of causal effect estimates. The empirical results across diverse tasks, including genomic studies, further validate the practical benefits of using calibrated propensity scores.

These findings have significant implications for the field of causal inference, as they highlight a critical consideration that should be addressed to ensure the validity of observational studies. The techniques presented in this paper provide a principled approach for researchers to calibrate their propensity score models and obtain more robust and trustworthy estimates of treatment effects.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🔮

New!Conformal Prediction for Causal Effects of Continuous Treatments

Maresa Schroder, Dennis Frauen, Jonas Schweisthal, Konstantin He{ss}, Valentyn Melnychuk, Stefan Feuerriegel

Uncertainty quantification of causal effects is crucial for safety-critical applications such as personalized medicine. A powerful approach for this is conformal prediction, which has several practical benefits due to model-agnostic finite-sample guarantees. Yet, existing methods for conformal prediction of causal effects are limited to binary/discrete treatments and make highly restrictive assumptions such as known propensity scores. In this work, we provide a novel conformal prediction method for potential outcomes of continuous treatments. We account for the additional uncertainty introduced through propensity estimation so that our conformal prediction intervals are valid even if the propensity score is unknown. Our contributions are three-fold: (1) We derive finite-sample prediction intervals for potential outcomes of continuous treatments. (2) We provide an algorithm for calculating the derived intervals. (3) We demonstrate the effectiveness of the conformal prediction intervals in experiments on synthetic and real-world datasets. To the best of our knowledge, we are the first to propose conformal prediction for continuous treatments when the propensity score is unknown and must be estimated from data.

7/4/2024

cs.LG cs.AI

👁️

Orthogonal Causal Calibration

Justin Whitehouse, Christopher Jung, Vasilis Syrgkanis, Bryan Wilder, Zhiwei Steven Wu

Estimates of causal parameters such as conditional average treatment effects and conditional quantile treatment effects play an important role in real-world decision making. Given this importance, one should ensure these estimators are calibrated. While there is a rich literature on calibrating estimators of non-causal parameters, very few methods have been derived for calibrating estimators of causal parameters, or more generally estimators of quantities involving nuisance parameters. In this work, we provide a general framework for calibrating predictors involving nuisance estimation. We consider a notion of calibration defined with respect to an arbitrary, nuisance-dependent loss $ell$, under which we say an estimator $theta$ is calibrated if its predictions cannot be changed on any level set to decrease loss. We prove generic upper bounds on the calibration error of any causal parameter estimate $theta$ with respect to any loss $ell$ using a concept called Neyman Orthogonality. Our bounds involve two decoupled terms - one measuring the error in estimating the unknown nuisance parameters, and the other representing the calibration error in a hypothetical world where the learned nuisance estimates were true. We use our bound to analyze the convergence of two sample splitting algorithms for causal calibration. One algorithm, which applies to universally orthogonalizable loss functions, transforms the data into generalized pseudo-outcomes and applies an off-the-shelf calibration procedure. The other algorithm, which applies to conditionally orthogonalizable loss functions, extends the classical uniform mass binning algorithm to include nuisance estimation. Our results are exceedingly general, showing that essentially any existing calibration algorithm can be used in causal settings, with additional loss only arising from errors in nuisance estimation.

6/5/2024

stat.ML cs.LG

Inverse Probability of Treatment Weighting with Deep Sequence Models Enables Accurate treatment effect Estimation from Electronic Health Records

Junghwan Lee, Simin Ma, Nicoleta Serban, Shihao Yang

Observational data have been actively used to estimate treatment effect, driven by the growing availability of electronic health records (EHRs). However, EHRs typically consist of longitudinal records, often introducing time-dependent confoundings that hinder the unbiased estimation of treatment effect. Inverse probability of treatment weighting (IPTW) is a widely used propensity score method since it provides unbiased treatment effect estimation and its derivation is straightforward. In this study, we aim to utilize IPTW to estimate treatment effect in the presence of time-dependent confounding using claims records. Previous studies have utilized propensity score methods with features derived from claims records through feature processing, which generally requires domain knowledge and additional resources to extract information to accurately estimate propensity scores. Deep sequence models, particularly recurrent neural networks and self-attention-based architectures, have demonstrated good performance in modeling EHRs for various downstream tasks. We propose that these deep sequence models can provide accurate IPTW estimation of treatment effect by directly estimating the propensity scores from claims records without the need for feature processing. We empirically demonstrate this by conducting comprehensive evaluations using synthetic and semi-synthetic datasets.

6/14/2024

cs.LG

🤔

Debiased Collaborative Filtering with Kernel-Based Causal Balancing

Haoxuan Li, Chunyuan Zheng, Yanghao Xiao, Peng Wu, Zhi Geng, Xu Chen, Peng Cui

Debiased collaborative filtering aims to learn an unbiased prediction model by removing different biases in observational datasets. To solve this problem, one of the simple and effective methods is based on the propensity score, which adjusts the observational sample distribution to the target one by reweighting observed instances. Ideally, propensity scores should be learned with causal balancing constraints. However, existing methods usually ignore such constraints or implement them with unreasonable approximations, which may affect the accuracy of the learned propensity scores. To bridge this gap, in this paper, we first analyze the gaps between the causal balancing requirements and existing methods such as learning the propensity with cross-entropy loss or manually selecting functions to balance. Inspired by these gaps, we propose to approximate the balancing functions in reproducing kernel Hilbert space and demonstrate that, based on the universal property and representer theorem of kernel functions, the causal balancing constraints can be better satisfied. Meanwhile, we propose an algorithm that adaptively balances the kernel function and theoretically analyze the generalization error bound of our methods. We conduct extensive experiments to demonstrate the effectiveness of our methods, and to promote this research direction, we have released our project at https://github.com/haoxuanli-pku/ICLR24-Kernel-Balancing.

5/1/2024

cs.IR cs.LG