From Narratives to Numbers: Valid Inference Using Language Model Predictions from Verbal Autopsy Narratives

Read original: arXiv:2404.02438 - Published 4/4/2024 by Shuxian Fan, Adam Visokay, Kentaro Hoffman, Stephen Salerno, Li Liu, Jeffrey T. Leek, Tyler H. McCormick

🤯

Overview

Verbal autopsies (VAs) are a common tool to monitor causes of death in places where most deaths occur outside the healthcare system.
VAs involve interviews with surviving caregivers or relatives to predict the deceased's cause of death.
Turning VA data into actionable insights requires two steps: (1) predicting likely causes of death from the VA interview, and (2) performing statistical analysis on the predicted causes.
This paper presents a new method called multiPPI++ that enables valid statistical inference using predicted causes of death from VA data.

Plain English Explanation

Knowing the common causes of death in a population is essential for public health officials to identify issues and allocate resources effectively. However, in many parts of the world, most deaths occur outside of hospitals or clinics, making it challenging to accurately track the causes.

Verbal autopsies provide a way to gather this information. They involve interviewing the family or caregiver of someone who has died and using the information they provide to estimate the person's likely cause of death. This data can then be analyzed to understand the overall trends in causes of death in that population.

The challenge is turning the raw VA interview data into meaningful insights that can inform public health decisions. The researchers in this paper developed a new statistical method called multiPPI++ to address this challenge.

multiPPI++ allows researchers to take the cause of death predictions from the VA interviews and then perform rigorous statistical analysis on them. This is important because the cause of death predictions may not be 100% accurate, and accounting for this uncertainty is crucial for making valid conclusions.

The key innovation of multiPPI++ is that it can handle uncertainty in the cause of death predictions, whether they come from a highly accurate prediction model or a less accurate one. This means researchers don't have to worry as much about having the "perfect" prediction model - multiPPI++ can work with whatever tool they have available.

Overall, this work helps make it easier for public health officials to leverage verbal autopsy data to understand the patterns and trends in causes of death in their communities. This information is vital for identifying priorities and allocating resources to save lives.

Technical Explanation

The core challenge addressed in this paper is how to perform valid statistical inference on cause of death (COD) predictions derived from verbal autopsy (VA) interview data using state-of-the-art natural language processing (NLP) techniques.

The authors developed a new method called multiPPI++ that extends recent work on "prediction-powered inference" to the case of multinomial classification (i.e. predicting one of multiple possible causes of death). multiPPI++ leverages a suite of NLP techniques, including language models like GPT-4-32k, to generate COD predictions from free-form VA interview text.

Crucially, multiPPI++ is able to recover unbiased estimates of the true underlying COD distribution, regardless of whether the NLP model used for prediction is highly accurate (like GPT-4-32k) or less accurate (like a k-nearest neighbors classifier). This is an important advancement, as it means researchers don't have to worry as much about having the "perfect" prediction model - multiPPI++ can handle the uncertainty.

The authors demonstrate the effectiveness of multiPPI++ through empirical analysis of real-world VA data. They show that the method is able to accurately estimate the breakdown of CODs by demographic factors, even when the underlying NLP predictions contain errors.

The key insight is that properly accounting for the prediction uncertainty through multiPPI++ is essential for drawing valid conclusions from the data. This has important implications for public health decision-making, as it suggests that even imperfect NLP tools can be leveraged effectively for this task as long as the statistical inference is done correctly.

Critical Analysis

The authors acknowledge several limitations of their work:

The performance of multiPPI++ depends on the quality and representativeness of the labeled training data used to build the NLP models. If the training data does not capture the full diversity of causes of death in the population, the predictions may be biased.
The current implementation of multiPPI++ assumes independence between the multinomial COD predictions. In reality, there may be complex dependencies that are not captured by this assumption.
The paper focuses on verbal autopsy data, but the multiPPI++ approach could potentially be applied to other settings where outcomes are predicted from free-form text. Exploring these other use cases would be a valuable area for future research.

One additional concern is the reliance on language models like GPT-4-32k, which are proprietary and not accessible to many researchers. Having an open-source, high-performance COD prediction model would improve the practical applicability of the multiPPI++ approach.

Overall, this is a well-designed and impactful study that advances the state of the art in leveraging imperfect prediction models for public health research and decision-making. The authors' thoughtful consideration of the limitations and areas for further work strengthen the contribution.

Conclusion

This paper presents a novel statistical method called multiPPI++ that enables valid inference on cause of death (COD) predictions derived from verbal autopsy (VA) data using state-of-the-art natural language processing techniques.

The key innovation is that multiPPI++ can recover unbiased estimates of the true underlying COD distribution, regardless of the accuracy of the NLP model used to generate the predictions. This is a significant advancement, as it means researchers can leverage even imperfect prediction models to gain meaningful insights from VA data.

By addressing the challenge of statistical inference on predicted outcomes, this work helps unlock the potential of verbal autopsies to inform public health decision-making in settings where most deaths occur outside the formal healthcare system. The insights generated can guide the allocation of resources and the design of interventions to reduce preventable mortality.

Overall, the multiPPI++ method represents an important step forward in bridging the gap between advanced AI/ML techniques and their real-world application to critical public health challenges.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🤯

From Narratives to Numbers: Valid Inference Using Language Model Predictions from Verbal Autopsy Narratives

Shuxian Fan, Adam Visokay, Kentaro Hoffman, Stephen Salerno, Li Liu, Jeffrey T. Leek, Tyler H. McCormick

In settings where most deaths occur outside the healthcare system, verbal autopsies (VAs) are a common tool to monitor trends in causes of death (COD). VAs are interviews with a surviving caregiver or relative that are used to predict the decedent's COD. Turning VAs into actionable insights for researchers and policymakers requires two steps (i) predicting likely COD using the VA interview and (ii) performing inference with predicted CODs (e.g. modeling the breakdown of causes by demographic factors using a sample of deaths). In this paper, we develop a method for valid inference using outcomes (in our case COD) predicted from free-form text using state-of-the-art NLP techniques. This method, which we call multiPPI++, extends recent work in prediction-powered inference to multinomial classification. We leverage a suite of NLP techniques for COD prediction and, through empirical analysis of VA data, demonstrate the effectiveness of our approach in handling transportability issues. multiPPI++ recovers ground truth estimates, regardless of which NLP model produced predictions and regardless of whether they were produced by a more accurate predictor like GPT-4-32k or a less accurate predictor like KNN. Our findings demonstrate the practical importance of inference correction for public health decision-making and suggests that if inference tasks are the end goal, having a small amount of contextually relevant, high quality labeled data is essential regardless of the NLP algorithm.

4/4/2024

🤯

Bayesian Prediction-Powered Inference

R. Alex Hofer, Joshua Maynez, Bhuwan Dhingra, Adam Fisch, Amir Globerson, William W. Cohen

Prediction-powered inference (PPI) is a method that improves statistical estimates based on limited human-labeled data. Specifically, PPI methods provide tighter confidence intervals by combining small amounts of human-labeled data with larger amounts of data labeled by a reasonably accurate, but potentially biased, automatic system. We propose a framework for PPI based on Bayesian inference that allows researchers to develop new task-appropriate PPI methods easily. Exploiting the ease with which we can design new metrics, we propose improved PPI methods for several importantcases, such as autoraters that give discrete responses (e.g., prompted LLM ``judges'') and autoraters with scores that have a non-linear relationship to human scores.

5/13/2024

📊

Coding historical causes of death data with Large Language Models

Bj{o}rn Pedersen, Maisha Islam, Doris Tove Kristoffersen, Lars Ailo Bongo, Eilidh Garrett, Alice Reid, Hilde Sommerseth

This paper investigates the feasibility of using pre-trained generative Large Language Models (LLMs) to automate the assignment of ICD-10 codes to historical causes of death. Due to the complex narratives often found in historical causes of death, this task has traditionally been manually performed by coding experts. We evaluate the ability of GPT-3.5, GPT-4, and Llama 2 LLMs to accurately assign ICD-10 codes on the HiCaD dataset that contains causes of death recorded in the civil death register entries of 19,361 individuals from Ipswich, Kilmarnock, and the Isle of Skye from the UK between 1861-1901. Our findings show that GPT-3.5, GPT-4, and Llama 2 assign the correct code for 69%, 83%, and 40% of causes, respectively. However, we achieve a maximum accuracy of 89% by standard machine learning techniques. All LLMs performed better for causes of death that contained terms still in use today, compared to archaic terms. Also they perform better for short causes (1-2 words) compared to longer causes. LLMs therefore do not currently perform well enough for historical ICD-10 code assignment tasks. We suggest further fine-tuning or alternative frameworks to achieve adequate performance.

5/14/2024

🔎

Uncovering Misattributed Suicide Causes through Annotation Inconsistency Detection in Death Investigation Notes

Song Wang, Yiliang Zhou, Ziqiang Han, Cui Tao, Yunyu Xiao, Ying Ding, Joydeep Ghosh, Yifan Peng

Data accuracy is essential for scientific research and policy development. The National Violent Death Reporting System (NVDRS) data is widely used for discovering the patterns and causes of death. Recent studies suggested the annotation inconsistencies within the NVDRS and the potential impact on erroneous suicide-cause attributions. We present an empirical Natural Language Processing (NLP) approach to detect annotation inconsistencies and adopt a cross-validation-like paradigm to identify problematic instances. We analyzed 267,804 suicide death incidents between 2003 and 2020 from the NVDRS. Our results showed that incorporating the target state's data into training the suicide-crisis classifier brought an increase of 5.4% to the F-1 score on the target state's test set and a decrease of 1.1% on other states' test set. To conclude, we demonstrated the annotation inconsistencies in NVDRS's death investigation notes, identified problematic instances, evaluated the effectiveness of correcting problematic instances, and eventually proposed an NLP improvement solution.

4/1/2024