Leveraging Expert Consistency to Improve Algorithmic Decision Support

Read original: arXiv:2101.09648 - Published 6/4/2024 by Maria De-Arteaga, Vincent Jeanselme, Artur Dubrawski, Alexandra Chouldechova

➖

Overview

Machine learning (ML) is increasingly being used to support high-stakes decisions, but there is often a "construct gap" between the decision criteria and what is captured in the data used to train the ML models.
This can lead to ML models failing to capture important dimensions of the decision-making process, limiting their utility for decision support.
The authors explore using historical expert decisions as a source of information to help narrow the construct gap, in cases where experts exhibit consistency with each other.
The paper introduces a methodology to estimate expert consistency and combine expert decisions with observed outcomes to train better predictive models.

Plain English Explanation

Machine learning algorithms are often used to help make important decisions, such as in healthcare or child welfare. However, there can be a disconnect between what the algorithm is actually trying to predict (the "construct of interest") and the data that is available to train the algorithm (the "proxies" or labels).

For example, an algorithm designed to predict risk in child welfare cases might be trained on historical case data and outcomes, but the data may not fully capture all the factors that a human expert would consider when making a decision. As a result, the algorithm may miss important aspects of the decision-making process.

The researchers in this paper propose a way to address this issue by incorporating information from expert human decision-makers, in addition to the observed outcomes. The key insight is that in some cases, experts may make consistent decisions with each other, and this consistency can provide valuable information to help the algorithm better understand the true decision-making process.

The paper outlines a two-step methodology to do this. First, it uses an "influence function-based" approach to estimate the consistency of expert decisions, even when each case is only assessed by a single expert. Second, it introduces a "label amalgamation" approach that allows the algorithm to learn from both the expert decisions and the observed outcomes simultaneously.

Through simulations and real-world data, the researchers show that this approach can lead to better predictive performance than using either the expert decisions or the observed outcomes alone. This suggests that incorporating human expertise can be a valuable way to improve the performance of machine learning systems used for high-stakes decision support.

Technical Explanation

The paper explores the challenge of "construct gap" in the context of using machine learning (ML) to support high-stakes decisions. The construct gap refers to the mismatch between the decision criteria of interest and the proxies or labels used to train the ML models.

To address this, the authors propose leveraging historical expert decisions as an additional source of information, alongside observed outcomes, to train better predictive models. The key insight is that in some cases, experts may exhibit consistency in their decision-making, and this consistency can provide valuable signals to help narrow the construct gap.

The methodology has two core steps:

Estimating Expert Consistency: The authors develop an "influence function-based" approach to estimate expert consistency indirectly, even when each case is assessed by a single expert. This allows them to identify instances where experts exhibit high levels of agreement with each other.
Label Amalgamation: The authors introduce a "label amalgamation" approach that allows ML models to learn from both the expert decisions and the observed outcomes simultaneously. This enables the models to benefit from the additional information provided by the experts, while still capturing the patterns in the observed data.

The authors evaluate their approach through simulations in a clinical setting and using real-world data from the child welfare domain. The results indicate that the proposed methodology successfully narrows the construct gap, leading to better predictive performance compared to learning from either the expert decisions or the observed outcomes alone.

This work builds on prior research on the challenges of integrating human expertise with machine learning and evaluating the performance of predictive algorithms in high-stakes decision-making contexts.

Critical Analysis

The paper presents a well-designed and thoughtful approach to addressing the construct gap issue in using machine learning for high-stakes decision support. The authors' focus on leveraging expert decision-making as a valuable source of information is a promising direction, and the methodology they develop appears to be both theoretically grounded and empirically validated.

That said, the paper does acknowledge several caveats and limitations to their work. For example, the authors note that their approach relies on the assumption that experts exhibit some level of consistency in their decision-making, which may not always be the case. Additionally, the real-world data used in the evaluation is limited to a single domain (child welfare), and further research would be needed to assess the generalizability of the approach to other high-stakes decision-making contexts.

Another potential concern is the extent to which the proposed methodology can truly "narrow the construct gap." While the empirical results suggest improvements in predictive performance, it's unclear whether the models are fully capturing the nuanced decision-making criteria of human experts. There may still be important dimensions that are not adequately represented in the data, either from the experts or the observed outcomes.

Future research could explore ways to further bridge the gap between the construct of interest and the available data, perhaps through more sophisticated techniques for eliciting and incorporating expert knowledge, or by investigating alternative sources of information that could complement the expert decisions and observed outcomes.

Overall, this paper represents an important contribution to the growing body of work on integrating human expertise with machine learning for high-stakes decision support. The authors have demonstrated a thoughtful and rigorous approach to a challenging problem, and their work paves the way for further advancements in this critical area of research.

Conclusion

This paper presents a novel methodology for addressing the "construct gap" that can arise when using machine learning (ML) to support high-stakes decision-making. By incorporating historical expert decisions as an additional source of information, the authors show that it is possible to improve the predictive performance of ML models, compared to using either expert decisions or observed outcomes alone.

The key innovation is the two-step approach of (1) estimating expert consistency using an influence function-based method, and (2) combining the expert decisions and observed outcomes through a label amalgamation process. This allows the ML models to learn from both the experts' decision-making criteria and the patterns in the observed data, helping to narrow the construct gap.

The empirical evaluation, using both simulations and real-world data, demonstrates the effectiveness of this approach in a clinical setting and the child welfare domain. While the authors acknowledge several caveats and limitations, this work represents an important step forward in integrating human expertise with machine learning for high-stakes decision support.

As machine learning continues to be deployed in increasingly consequential domains, approaches like the one presented in this paper will be essential for ensuring that these systems capture the full complexity of human decision-making, and can be reliably used to support and augment human expertise, rather than simply replace it.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

➖

Leveraging Expert Consistency to Improve Algorithmic Decision Support

Maria De-Arteaga, Vincent Jeanselme, Artur Dubrawski, Alexandra Chouldechova

Machine learning (ML) is increasingly being used to support high-stakes decisions. However, there is frequently a construct gap: a gap between the construct of interest to the decision-making task and what is captured in proxies used as labels to train ML models. As a result, ML models may fail to capture important dimensions of decision criteria, hampering their utility for decision support. Thus, an essential step in the design of ML systems for decision support is selecting a target label among available proxies. In this work, we explore the use of historical expert decisions as a rich -- yet also imperfect -- source of information that can be combined with observed outcomes to narrow the construct gap. We argue that managers and system designers may be interested in learning from experts in instances where they exhibit consistency with each other, while learning from observed outcomes otherwise. We develop a methodology to enable this goal using information that is commonly available in organizational information systems. This involves two core steps. First, we propose an influence function-based methodology to estimate expert consistency indirectly when each case in the data is assessed by a single expert. Second, we introduce a label amalgamation approach that allows ML models to simultaneously learn from expert decisions and observed outcomes. Our empirical evaluation, using simulations in a clinical setting and real-world data from the child welfare domain, indicates that the proposed approach successfully narrows the construct gap, yielding better predictive performance than learning from either observed outcomes or expert decisions alone.

6/4/2024

(De)Noise: Moderating the Inconsistency Between Human Decision-Makers

Nina Grgi'c-Hlav{c}a, Junaid Ali, Krishna P. Gummadi, Jennifer Wortman Vaughan

Prior research in psychology has found that people's decisions are often inconsistent. An individual's decisions vary across time, and decisions vary even more across people. Inconsistencies have been identified not only in subjective matters, like matters of taste, but also in settings one might expect to be more objective, such as sentencing, job performance evaluations, or real estate appraisals. In our study, we explore whether algorithmic decision aids can be used to moderate the degree of inconsistency in human decision-making in the context of real estate appraisal. In a large-scale human-subject experiment, we study how different forms of algorithmic assistance influence the way that people review and update their estimates of real estate prices. We find that both (i) asking respondents to review their estimates in a series of algorithmically chosen pairwise comparisons and (ii) providing respondents with traditional machine advice are effective strategies for influencing human responses. Compared to simply reviewing initial estimates one by one, the aforementioned strategies lead to (i) a higher propensity to update initial estimates, (ii) a higher accuracy of post-review estimates, and (iii) a higher degree of consistency between the post-review estimates of different respondents. While these effects are more pronounced with traditional machine advice, the approach of reviewing algorithmically chosen pairs can be implemented in a wider range of settings, since it does not require access to ground truth data.

7/17/2024

🔮

Designing Decision Support Systems Using Counterfactual Prediction Sets

Eleni Straitouri, Manuel Gomez Rodriguez

Decision support systems for classification tasks are predominantly designed to predict the value of the ground truth labels. However, since their predictions are not perfect, these systems also need to make human experts understand when and how to use these predictions to update their own predictions. Unfortunately, this has been proven challenging. In this context, it has been recently argued that an alternative type of decision support systems may circumvent this challenge. Rather than providing a single label prediction, these systems provide a set of label prediction values constructed using a conformal predictor, namely a prediction set, and forcefully ask experts to predict a label value from the prediction set. However, the design and evaluation of these systems have so far relied on stylized expert models, questioning their promise. In this paper, we revisit the design of this type of systems from the perspective of online learning and develop a methodology that does not require, nor assumes, an expert model. Our methodology leverages the nested structure of the prediction sets provided by any conformal predictor and a natural counterfactual monotonicity assumption to achieve an exponential improvement in regret in comparison to vanilla bandit algorithms. We conduct a large-scale human subject study ($n = 2{,}751$) to compare our methodology to several competitive baselines. The results show that, for decision support systems based on prediction sets, limiting experts' level of agency leads to greater performance than allowing experts to always exercise their own agency. We have made available the data gathered in our human subject study as well as an open source implementation of our system at https://github.com/Networks-Learning/counterfactual-prediction-sets.

7/17/2024

🔮

Human Expertise in Algorithmic Prediction

Rohan Alur, Manish Raghavan, Devavrat Shah

We introduce a novel framework for incorporating human expertise into algorithmic predictions. Our approach focuses on the use of human judgment to distinguish inputs which `look the same' to any feasible predictive algorithm. We argue that this framing clarifies the problem of human/AI collaboration in prediction tasks, as experts often have access to information -- particularly subjective information -- which is not encoded in the algorithm's training data. We use this insight to develop a set of principled algorithms for selectively incorporating human feedback only when it improves the performance of any feasible predictor. We find empirically that although algorithms often outperform their human counterparts on average, human judgment can significantly improve algorithmic predictions on specific instances (which can be identified ex-ante). In an X-ray classification task, we find that this subset constitutes nearly 30% of the patient population. Our approach provides a natural way of uncovering this heterogeneity and thus enabling effective human-AI collaboration.

5/24/2024