Uncovering Misattributed Suicide Causes through Annotation Inconsistency Detection in Death Investigation Notes

Read original: arXiv:2403.19432 - Published 4/1/2024 by Song Wang, Yiliang Zhou, Ziqiang Han, Cui Tao, Yunyu Xiao, Ying Ding, Joydeep Ghosh, Yifan Peng
Total Score



Sign in to get full access


If you already have an account, we'll log you in


The National Violent Death Reporting System (NVDRS) is a comprehensive database gathering detailed information on violent fatalities, including suicide incidents, across the United States. It contains coded variables indicating the presence of various suicide-related social factors, annotated by human abstractors based on death investigation notes. However, only 5% of these annotations were verified by two independent annotators, raising concerns about potential inconsistencies due to individual annotator biases and human errors.

This study introduced an empirical Natural Language Processing (NLP) approach utilizing transformer-based models to uncover data annotation inconsistencies in the NVDRS death investigation notes. The researchers measured annotation discrepancies across U.S. states by evaluating the drop in F1 score when the model was trained on data from other states compared to the target state. A cross-validation-like framework was designed to identify problematic data instances contributing to these inconsistencies, which were then manually rectified and re-evaluated.

The experiments demonstrated the effectiveness of the approach in identifying potential annotation errors in the NVDRS data. Correcting these errors yielded an average F1 score improvement of 3.85%. The study also analyzed the Odds Ratio for various demographic subgroups to understand the risk of bias.

Overall, the work aimed to enhance the understanding of annotation inconsistencies in unstructured death investigation notes in the NVDRS, paving the way for more accurate and reliable utilization of this data in discovering suicide causes and developing prevention strategies at different levels.


The paper presents a method to identify and rectify annotation inconsistencies in unstructured death investigation notes for suicide causes. The authors demonstrate annotation inconsistencies across states by showing performance disparities when training crisis prediction systems on different combinations of state data.

They propose a method to identify problematic instances contributing to inconsistencies through cross-validation-like prediction error analysis. Removing these instances improves model performance and generalizability across states. Manual correction of identified potential mistakes in Ohio's data further enhances performance.

The authors examine the risk of bias by analyzing the relationship between suicide circumstances and demographics before and after removing identified mistakes. Differences in odds ratios suggest potential bias being uncovered.

Limitations include computational demands, exploring only BERT models, parameter tuning, bias identification, and manual label correction instead of automated methods.

Overall, the study introduces an empirical NLP approach to uncover and resolve annotation inconsistencies in unstructured death investigation data, improving data quality and understanding of suicide causes.


The study utilizes data from the National Violent Death Reporting System (NVDRS) dataset, covering 267,804 recorded suicide death incidents across the United States from 2003 to 2020. Three illustrative examples are used: Family Relationship Crisis, Mental Health Crisis, and Physical Health Crisis. The data was preprocessed to address class imbalance.

The method aims to validate annotation inconsistencies and identify problematic instances in the dataset. It assumes that consistent annotations should be equivalently predictive across data subsets. Cross-validation is performed, comparing model predictions to ground truth labels and counting discrepancies. A thresholding mechanism flags potentially mistaken instances.

An incremental training paradigm is used to validate label consistency after correcting mistakes. Logistic regression models examine the relationship between suicide circumstances and demographic variables before and after removing identified mistakes.

The BioBERT model is used for suicide crisis detection, framing it as a text classification task. Experiments are conducted on illustrative states (Ohio and Colorado) and crises.


This paper proposed an empirical natural language processing (NLP) approach to detect data annotation inconsistencies in the National Violent Death Reporting System (NVDRS) death investigation notes. Inconsistent annotations hamper the understanding of suicide causes and impede the development of effective suicide prevention strategies. The approach identified problematic instances causing inconsistencies and verified the effectiveness of correcting them. Experiment results showcased the capabilities and generalizability of the approach, while also highlighting its limitations. The authors intend to refine and expand their methodology to address annotation inconsistencies across diverse data sources. They advocate for establishing more stringent annotation guidelines and quality control measures to ensure consistent and reliable dataset annotations. Enhancing annotation accuracy and consistency within datasets can improve the performance and reliability of NLP models, ultimately supporting the discovery of true suicide causes and contributing to suicide prevention efforts.

Code Availability

The code for this work has been made publicly available on GitHub at https://github.com/bionlplab/2024_npjDM_Inconsistency_Detection.

Data Availability

The dataset analyzed in this study, NVDRS RAD, is available by request for eligible users due to its confidential nature. NVDRS contains sensitive information that could inadvertently reveal the identities of suspects and victims. To protect confidentiality and prevent unauthorized access, the CDC requires users to meet certain eligibility criteria and implement security measures. Researchers can apply for access to NVDRS by following the instructions provided.


This research was funded by the National Science Foundation (NSF) CAREER Award No. 2145640, an Amazon Research Award, and the National Institutes of Health Aim-Ahead Award No. OT2OD032581.

Author Contributions

The individuals S.W., Y.X., and Y.P. contributed to the study's conception and design. S.W., Y.X., and Y.P. were responsible for acquiring the data. S.W., Y.Z., Y.X., and Y.P. analyzed and interpreted the data. Z.H., C.T., Y.X., Y.D., J.G., and Y.P. provided strategic guidance. S.W. and Y.P. contributed to paper organization and team logistics. S.W., Y.Z., Z.H., C.T., Y.X., Y.D., J.G., and Y.P. contributed to drafting and revising the manuscript.

Competing Interests

There is no text provided to summarize in this section. The authors state that they have no financial or non-financial competing interests related to the research.

Appendix A Supplementary materials

The paper discusses the selection of crises and states for their study. They chose Physical Health, Family Relationship, and Mental Health crises due to their higher frequency and poor classification performance in previous work. Ohio and Colorado were selected as states for analysis because they had a higher frequency of positive instances and better classification scores compared to other states.

For model training, Binary Cross Entropy Loss and the Adam optimizer were used. Models were trained for 30 epochs, and model selection was based on validation set performance. The framework was implemented using PyTorch, and experiments were conducted on an Intel Xeon 6226R 16-core processor and Nvidia RTX A6000 GPUs.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers


Total Score


Uncovering Misattributed Suicide Causes through Annotation Inconsistency Detection in Death Investigation Notes

Song Wang, Yiliang Zhou, Ziqiang Han, Cui Tao, Yunyu Xiao, Ying Ding, Joydeep Ghosh, Yifan Peng

Data accuracy is essential for scientific research and policy development. The National Violent Death Reporting System (NVDRS) data is widely used for discovering the patterns and causes of death. Recent studies suggested the annotation inconsistencies within the NVDRS and the potential impact on erroneous suicide-cause attributions. We present an empirical Natural Language Processing (NLP) approach to detect annotation inconsistencies and adopt a cross-validation-like paradigm to identify problematic instances. We analyzed 267,804 suicide death incidents between 2003 and 2020 from the NVDRS. Our results showed that incorporating the target state's data into training the suicide-crisis classifier brought an increase of 5.4% to the F-1 score on the target state's test set and a decrease of 1.1% on other states' test set. To conclude, we demonstrated the annotation inconsistencies in NVDRS's death investigation notes, identified problematic instances, evaluated the effectiveness of correcting problematic instances, and eventually proposed an NLP improvement solution.

Read more



Total Score


From Narratives to Numbers: Valid Inference Using Language Model Predictions from Verbal Autopsy Narratives

Shuxian Fan, Adam Visokay, Kentaro Hoffman, Stephen Salerno, Li Liu, Jeffrey T. Leek, Tyler H. McCormick

In settings where most deaths occur outside the healthcare system, verbal autopsies (VAs) are a common tool to monitor trends in causes of death (COD). VAs are interviews with a surviving caregiver or relative that are used to predict the decedent's COD. Turning VAs into actionable insights for researchers and policymakers requires two steps (i) predicting likely COD using the VA interview and (ii) performing inference with predicted CODs (e.g. modeling the breakdown of causes by demographic factors using a sample of deaths). In this paper, we develop a method for valid inference using outcomes (in our case COD) predicted from free-form text using state-of-the-art NLP techniques. This method, which we call multiPPI++, extends recent work in prediction-powered inference to multinomial classification. We leverage a suite of NLP techniques for COD prediction and, through empirical analysis of VA data, demonstrate the effectiveness of our approach in handling transportability issues. multiPPI++ recovers ground truth estimates, regardless of which NLP model produced predictions and regardless of whether they were produced by a more accurate predictor like GPT-4-32k or a less accurate predictor like KNN. Our findings demonstrate the practical importance of inference correction for public health decision-making and suggests that if inference tasks are the end goal, having a small amount of contextually relevant, high quality labeled data is essential regardless of the NLP algorithm.

Read more



Total Score


Supervised Learning and Large Language Model Benchmarks on Mental Health Datasets: Cognitive Distortions and Suicidal Risks in Chinese Social Media

Hongzhi Qi, Qing Zhao, Jianqiang Li, Changwei Song, Wei Zhai, Dan Luo, Shuo Liu, Yi Jing Yu, Fan Wang, Huijing Zou, Bing Xiang Yang, Guanghui Fu

On social media, users often express their personal feelings, which may exhibit cognitive distortions or even suicidal tendencies on certain specific topics. Early recognition of these signs is critical for effective psychological intervention. In this paper, we introduce two novel datasets from Chinese social media: SOS-HL-1K for suicidal risk classification and SocialCD-3K for cognitive distortions detection. The SOS-HL-1K dataset contained 1,249 posts and SocialCD-3K dataset was a multi-label classification dataset that containing 3,407 posts. We propose a comprehensive evaluation using two supervised learning methods and eight large language models (LLMs) on the proposed datasets. From the prompt engineering perspective, we experimented with two types of prompt strategies, including four zero-shot and five few-shot strategies. We also evaluated the performance of the LLMs after fine-tuning on the proposed tasks. The experimental results show that there is still a huge gap between LLMs relying only on prompt engineering and supervised learning. In the suicide classification task, this gap is 6.95% points in F1-score, while in the cognitive distortion task, the gap is even more pronounced, reaching 31.53% points in F1-score. However, after fine-tuning, this difference is significantly reduced. In the suicide and cognitive distortion classification tasks, the gap decreases to 4.31% and 3.14%, respectively. This research highlights the potential of LLMs in psychological contexts, but supervised learning remains necessary for more challenging tasks. All datasets and code are made available.

Read more


Enhancing Suicide Risk Detection on Social Media through Semi-Supervised Deep Label Smoothing
Total Score


Enhancing Suicide Risk Detection on Social Media through Semi-Supervised Deep Label Smoothing

Matthew Squires, Xiaohui Tao, Soman Elangovan, U Rajendra Acharya, Raj Gururajan, Haoran Xie, Xujuan Zhou

Suicide is a prominent issue in society. Unfortunately, many people at risk for suicide do not receive the support required. Barriers to people receiving support include social stigma and lack of access to mental health care. With the popularity of social media, people have turned to online forums, such as Reddit to express their feelings and seek support. This provides the opportunity to support people with the aid of artificial intelligence. Social media posts can be classified, using text classification, to help connect people with professional help. However, these systems fail to account for the inherent uncertainty in classifying mental health conditions. Unlike other areas of healthcare, mental health conditions have no objective measurements of disease often relying on expert opinion. Thus when formulating deep learning problems involving mental health, using hard, binary labels does not accurately represent the true nature of the data. In these settings, where human experts may disagree, fuzzy or soft labels may be more appropriate. The current work introduces a novel label smoothing method which we use to capture any uncertainty within the data. We test our approach on a five-label multi-class classification problem. We show, our semi-supervised deep label smoothing method improves classification accuracy above the existing state of the art. Where existing research reports an accuracy of 43% on the Reddit C-SSRS dataset, using empirical experiments to evaluate our novel label smoothing method, we improve upon this existing benchmark to 52%. These improvements in model performance have the potential to better support those experiencing mental distress. Future work should explore the use of probabilistic methods in both natural language processing and quantifying contributions of both epistemic and aleatoric uncertainty in noisy datasets.

Read more
