Robust Design and Evaluation of Predictive Algorithms under Unobserved Confounding






Published 5/21/2024 by Ashesh Rambachan, Amanda Coston, Edward Kennedy



Predictive algorithms inform consequential decisions in settings where the outcome is selectively observed given choices made by human decision makers. We propose a unified framework for the robust design and evaluation of predictive algorithms in selectively observed data. We impose general assumptions on how much the outcome may vary on average between unselected and selected units conditional on observed covariates and identified nuisance parameters, formalizing popular empirical strategies for imputing missing data such as proxy outcomes and instrumental variables. We develop debiased machine learning estimators for the bounds on a large class of predictive performance estimands, such as the conditional likelihood of the outcome, a predictive algorithm's mean square error, true/false positive rate, and many others, under these assumptions. In an administrative dataset from a large Australian financial institution, we illustrate how varying assumptions on unobserved confounding leads to meaningful changes in default risk predictions and evaluations of credit scores across sensitive groups.

Create account to get full access


If you already have an account, we'll log you in


  • This paper proposes a unified framework for designing and evaluating predictive algorithms in situations where the outcome is selectively observed based on choices made by human decision makers.
  • The authors introduce general assumptions about how much the outcome may vary on average between unselected and selected units, given observed covariates and identified nuisance parameters.
  • They develop debiased machine learning estimators to calculate bounds on various predictive performance measures, such as likelihood, mean squared error, and true/false positive rates, under these assumptions.
  • The paper illustrates how varying assumptions about unobserved confounding can lead to meaningful changes in default risk predictions and evaluations of credit scores across sensitive groups, using administrative data from a large Australian financial institution.

Plain English Explanation

Predictive algorithms are used to make important decisions in many real-world settings, such as lending or insurance. However, the data used to train and evaluate these algorithms may be biased because the outcomes are only observed for the people who were selected for the relevant action (e.g., who received a loan).

This paper proposes a way to account for this "selective observation" when designing and evaluating predictive algorithms. The authors introduce general assumptions about how much the outcomes might differ on average between the selected and unselected groups, based on the observed information. Using these assumptions, they develop new statistical methods to calculate bounds on how well the predictive algorithms are performing, in terms of metrics like accuracy, error rates, and fairness across different groups.

The paper demonstrates how these methods can lead to different conclusions about the performance and fairness of credit risk prediction models, compared to standard approaches that don't account for selective observation. By being more careful about potential biases in the data, the researchers can get a better understanding of how well these algorithms are really working in practice.

Technical Explanation

The key innovation in this paper is a unified framework for designing and evaluating predictive algorithms in situations with selectively observed data. The authors introduce general assumptions about how much the outcome may vary on average between unselected and selected units, conditional on observed covariates and identified nuisance parameters.

These assumptions formalize popular strategies for handling missing data, such as using proxy outcomes or instrumental variables. The authors then develop debiased machine learning estimators to calculate bounds on a wide range of predictive performance measures, including likelihood, mean squared error, and true/false positive rates.

To illustrate their framework, the researchers analyze an administrative dataset from a large Australian financial institution. They show how varying the assumptions about unobserved confounding can lead to meaningful changes in default risk predictions and evaluations of credit scores across sensitive groups.

Critical Analysis

The authors acknowledge several limitations and areas for further research. First, the general assumptions they introduce about unobserved confounding may not hold in all real-world situations, and the resulting bounds may still be quite wide. More investigation is needed into how sensitive the results are to the specific assumptions made.

Additionally, the paper focuses on predictive performance metrics, but does not directly address other important considerations like model interpretability or deployment feasibility. Further work is needed to understand how this framework integrates with the broader challenges of responsible AI development and deployment.

Another potential issue is the reliance on debiased machine learning estimators, which can be computationally intensive and may require careful tuning. The practicality of applying these methods at scale in real-world settings remains an open question.

Overall, this paper represents an important step forward in addressing selective observation bias in predictive algorithms. By making the underlying assumptions explicit and developing principled ways to bound performance, the authors provide a useful toolbox for researchers and practitioners working to build more robust and fair predictive systems.


This paper presents a unified framework for designing and evaluating predictive algorithms in the presence of selectively observed data. By formalizing assumptions about unobserved confounding, the authors develop new statistical methods to calculate bounds on a variety of predictive performance metrics.

The demonstrated ability to uncover meaningful differences in default risk predictions and credit score evaluations, compared to standard approaches, highlights the importance of accounting for selective observation bias. As predictive algorithms continue to inform high-stakes decisions in areas like lending and insurance, this work represents a valuable contribution toward building more transparent, robust, and fair AI systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers


Auditing Fairness under Unobserved Confounding

Yewon Byun, Dylan Sam, Michael Oberst, Zachary C. Lipton, Bryan Wilder





The presence of inequity is a fundamental problem in the outcomes of decision-making systems, especially when human lives are at stake. Yet, estimating notions of unfairness or inequity is difficult, particularly if they rely on hard-to-measure concepts such as risk. Such measurements of risk can be accurately obtained when no unobserved confounders have jointly influenced past decisions and outcomes. However, in the real world, this assumption rarely holds. In this paper, we show a surprising result that one can still give meaningful bounds on treatment rates to high-risk individuals, even when entirely eliminating or relaxing the assumption that all relevant risk factors are observed. We use the fact that in many real-world settings (e.g., the release of a new treatment) we have data from prior to any allocation to derive unbiased estimates of risk. This result is of immediate practical interest: we can audit unfair outcomes of existing decision-making systems in a principled manner. For instance, in a real-world study of Paxlovid allocation, our framework provably identifies that observed racial inequity cannot be explained by unobserved confounders of the same strength as important observed covariates.

Read more


Predictive Performance Comparison of Decision Policies Under Confounding

Predictive Performance Comparison of Decision Policies Under Confounding

Luke Guerdan, Amanda Coston, Kenneth Holstein, Zhiwei Steven Wu





Predictive models are often introduced to decision-making tasks under the rationale that they improve performance over an existing decision-making policy. However, it is challenging to compare predictive performance against an existing decision-making policy that is generally under-specified and dependent on unobservable factors. These sources of uncertainty are often addressed in practice by making strong assumptions about the data-generating mechanism. In this work, we propose a method to compare the predictive performance of decision policies under a variety of modern identification approaches from the causal inference and off-policy evaluation literatures (e.g., instrumental variable, marginal sensitivity model, proximal variable). Key to our method is the insight that there are regions of uncertainty that we can safely ignore in the policy comparison. We develop a practical approach for finite-sample estimation of regret intervals under no assumptions on the parametric form of the status quo policy. We verify our framework theoretically and via synthetic data experiments. We conclude with a real-world application using our framework to support a pre-deployment evaluation of a proposed modification to a healthcare enrollment policy.

Read more



Automating the Selection of Proxy Variables of Unmeasured Confounders

Feng Xie, Zhengming Chen, Shanshan Luo, Wang Miao, Ruichu Cai, Zhi Geng





Recently, interest has grown in the use of proxy variables of unobserved confounding for inferring the causal effect in the presence of unmeasured confounders from observational data. One difficulty inhibiting the practical use is finding valid proxy variables of unobserved confounding to a target causal effect of interest. These proxy variables are typically justified by background knowledge. In this paper, we investigate the estimation of causal effects among multiple treatments and a single outcome, all of which are affected by unmeasured confounders, within a linear causal model, without prior knowledge of the validity of proxy variables. To be more specific, we first extend the existing proxy variable estimator, originally addressing a single unmeasured confounder, to accommodate scenarios where multiple unmeasured confounders exist between the treatments and the outcome. Subsequently, we present two different sets of precise identifiability conditions for selecting valid proxy variables of unmeasured confounders, based on the second-order statistics and higher-order statistics of the data, respectively. Moreover, we propose two data-driven methods for the selection of proxy variables and for the unbiased estimation of causal effects. Theoretical analysis demonstrates the correctness of our proposed algorithms. Experimental results on both synthetic and real-world data show the effectiveness of the proposed approach.

Read more



Hidden yet quantifiable: A lower bound for confounding strength using randomized trials

Piersilvio De Bartolomeis, Javier Abad, Konstantin Donhauser, Fanny Yang





In the era of fast-paced precision medicine, observational studies play a major role in properly evaluating new treatments in clinical practice. Yet, unobserved confounding can significantly compromise causal conclusions drawn from non-randomized data. We propose a novel strategy that leverages randomized trials to quantify unobserved confounding. First, we design a statistical test to detect unobserved confounding with strength above a given threshold. Then, we use the test to estimate an asymptotically valid lower bound on the unobserved confounding strength. We evaluate the power and validity of our statistical test on several synthetic and semi-synthetic datasets. Further, we show how our lower bound can correctly identify the absence and presence of unobserved confounding in a real-world setting.

Read more
