FactPICO: Factuality Evaluation for Plain Language Summarization of Medical Evidence

Read original: arXiv:2402.11456 - Published 6/6/2024 by Sebastian Antony Joseph, Lily Chen, Jan Trienes, Hannah Louisa Goke, Monika Coers, Wei Xu, Byron C Wallace, Junyi Jessy Li

💬

Overview

• This paper presents FactPICO, a benchmark for evaluating the factuality of plain language summaries of medical research papers describing randomized controlled trials (RCTs).

• RCTs are the foundation of evidence-based medicine and can directly inform patient treatment, so it's crucial that their key details are accurately captured in plain language summaries.

• The authors assess the factuality of critical elements of RCTs, such as the Populations, Interventions, Comparators, and Outcomes (PICO), as well as the reported findings, in summaries generated by three large language models (LLMs).

• They also evaluate the correctness of any additional explanatory information included in the summaries.

Plain English Explanation

• Summaries written in plain, easy-to-understand language can make technical medical research more accessible to the general public. However, it's important that these summaries accurately capture the key facts and findings.

• Randomized controlled trials (RCTs) are considered the gold standard for medical research, as they provide the strongest evidence for determining the effectiveness of treatments. The details of RCTs, like the types of patients involved, the treatments being compared, and the outcomes measured, are crucial for informing healthcare decisions.

• This paper created a benchmark, called FactPICO, to assess how well plain language summaries generated by AI language models (like GPT-4, Llama-2, and Alpaca) capture the factual information from RCT abstracts. Experts carefully evaluated these summaries to see if the key PICO elements and overall findings were accurately represented.

• The researchers found that while plain language summaries can be helpful, it's still challenging for AI to balance simplicity and factual accuracy, especially for complex medical information. Existing factuality metrics don't always align well with expert human judgments on the accuracy of these summaries.

Technical Explanation

• The authors constructed FactPICO, a dataset of 345 plain language summaries of RCT abstracts generated by three LLMs: GPT-4, Llama-2, and Alpaca.

• They recruited medical experts to provide fine-grained evaluations and natural language rationales on the factuality of the PICO elements and the reported findings in each summary.

• The researchers assessed the performance of various factuality metrics, including newly-developed ones based on LLMs, in predicting the expert judgments on the instance level.

• Their analysis revealed that existing factuality metrics do not always correlate well with expert assessments of the plain language summaries, especially when it comes to balancing simplicity and accuracy.

• The findings suggest that plain language summarization of medical evidence is still a challenging task, and that more research is needed to develop robust factuality evaluation approaches for this domain.

Critical Analysis

• The paper highlights the importance of accurately capturing the critical details of RCTs in plain language summaries, as these studies directly inform patient care. However, the authors acknowledge the inherent tension between simplicity and factuality when summarizing complex medical information.

• While the FactPICO benchmark provides a valuable resource for evaluating plain language summarization, the authors note that their dataset is relatively small, and the factuality judgments may be subjective to some degree.

• The poor correlation between existing factuality metrics and expert judgments suggests that more research is needed to develop robust and reliable ways to assess the factual accuracy of plain language summaries, especially in high-stakes domains like medicine.

• Future work could explore ways to better align plain language generation with the key PICO elements and findings, perhaps through more targeted training or specialized architectures. Techniques like synthetic data generation could also be investigated to expand the FactPICO dataset and improve the robustness of the benchmark.

Conclusion

• This paper presents an important benchmark, FactPICO, for evaluating the factual accuracy of plain language summaries of medical research, with a focus on randomized controlled trials.

• The findings suggest that while plain language summaries can improve accessibility, it remains challenging for AI language models to balance simplicity and factual correctness, especially for complex medical content.

• The poor performance of existing factuality metrics highlights the need for further research to develop more reliable and nuanced ways to assess the factual accuracy of plain language summaries, particularly in high-stakes domains like healthcare.

• Improving the factuality of plain language summaries could have significant implications for enhancing the accessibility and usability of medical research, ultimately benefiting patients and the general public.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

💬

FactPICO: Factuality Evaluation for Plain Language Summarization of Medical Evidence

Sebastian Antony Joseph, Lily Chen, Jan Trienes, Hannah Louisa Goke, Monika Coers, Wei Xu, Byron C Wallace, Junyi Jessy Li

Plain language summarization with LLMs can be useful for improving textual accessibility of technical content. But how factual are these summaries in a high-stakes domain like medicine? This paper presents FactPICO, a factuality benchmark for plain language summarization of medical texts describing randomized controlled trials (RCTs), which are the basis of evidence-based medicine and can directly inform patient treatment. FactPICO consists of 345 plain language summaries of RCT abstracts generated from three LLMs (i.e., GPT-4, Llama-2, and Alpaca), with fine-grained evaluation and natural language rationales from experts. We assess the factuality of critical elements of RCTs in those summaries: Populations, Interventions, Comparators, Outcomes (PICO), as well as the reported findings concerning these. We also evaluate the correctness of the extra information (e.g., explanations) added by LLMs. Using FactPICO, we benchmark a range of existing factuality metrics, including the newly devised ones based on LLMs. We find that plain language summarization of medical evidence is still challenging, especially when balancing between simplicity and factuality, and that existing metrics correlate poorly with expert judgments on the instance level.

6/6/2024

FENICE: Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction

Alessandro Scir`e, Karim Ghonim, Roberto Navigli

Recent advancements in text summarization, particularly with the advent of Large Language Models (LLMs), have shown remarkable performance. However, a notable challenge persists as a substantial number of automatically-generated summaries exhibit factual inconsistencies, such as hallucinations. In response to this issue, various approaches for the evaluation of consistency for summarization have emerged. Yet, these newly-introduced metrics face several limitations, including lack of interpretability, focus on short document summaries (e.g., news articles), and computational impracticality, especially for LLM-based metrics. To address these shortcomings, we propose Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction (FENICE), a more interpretable and efficient factuality-oriented metric. FENICE leverages an NLI-based alignment between information in the source document and a set of atomic facts, referred to as claims, extracted from the summary. Our metric sets a new state of the art on AGGREFACT, the de-facto benchmark for factuality evaluation. Moreover, we extend our evaluation to a more challenging setting by conducting a human annotation process of long-form summarization. In the hope of fostering research in summarization factuality evaluation, we release the code of our metric and our factuality annotations of long-form summarization at https://github.com/Babelscape/FENICE.

9/4/2024

Measuring text summarization factuality using atomic facts entailment metrics in the context of retrieval augmented generation

N. E. Kriman

The use of large language models (LLMs) has significantly increased since the introduction of ChatGPT in 2022, demonstrating their value across various applications. However, a major challenge for enterprise and commercial adoption of LLMs is their tendency to generate inaccurate information, a phenomenon known as hallucination. This project proposes a method for estimating the factuality of a summary generated by LLMs when compared to a source text. Our approach utilizes Naive Bayes classification to assess the accuracy of the content produced.

8/28/2024

SYNFAC-EDIT: Synthetic Imitation Edit Feedback for Factual Alignment in Clinical Summarization

Prakamya Mishra, Zonghai Yao, Parth Vashisht, Feiyun Ouyang, Beining Wang, Vidhi Dhaval Mody, Hong Yu

Large Language Models (LLMs) such as GPT & Llama have demonstrated significant achievements in summarization tasks but struggle with factual inaccuracies, a critical issue in clinical NLP applications where errors could lead to serious consequences. To counter the high costs and limited availability of expert-annotated data for factual alignment, this study introduces an innovative pipeline that utilizes >100B parameter GPT variants like GPT-3.5 & GPT-4 to act as synthetic experts to generate high-quality synthetics feedback aimed at enhancing factual consistency in clinical note summarization. Our research primarily focuses on edit feedback generated by these synthetic feedback experts without additional human annotations, mirroring and optimizing the practical scenario in which medical professionals refine AI system outputs. Although such 100B+ parameter GPT variants have proven to demonstrate expertise in various clinical NLP tasks, such as the Medical Licensing Examination, there is scant research on their capacity to act as synthetic feedback experts and deliver expert-level edit feedback for improving the generation quality of weaker (<10B parameter) LLMs like GPT-2 (1.5B) & Llama 2 (7B) in clinical domain. So in this work, we leverage 100B+ GPT variants to act as synthetic feedback experts offering expert-level edit feedback, that is used to reduce hallucinations and align weaker (<10B parameter) LLMs with medical facts using two distinct alignment algorithms (DPO & SALT), endeavoring to narrow the divide between AI-generated content and factual accuracy. This highlights the substantial potential of LLM-based synthetic edits in enhancing the alignment of clinical factuality.

4/19/2024