OLAPH: Improving Factuality in Biomedical Long-form Question Answering

Read original: arXiv:2405.12701 - Published 5/22/2024 by Minbyul Jeong, Hyeon Hwang, Chanwoong Yoon, Taewhoo Lee, Jaewoo Kang

🎲

Overview

The paper introduces MedLFQA, a benchmark dataset for evaluating the factuality of long-form medical question-answering by large language models (LLMs).
The authors propose OLAPH, a framework that iteratively trains LLMs to improve factuality by using sampling predictions and preference optimization.
The paper showcases how LLMs trained with the OLAPH framework can provide long answers comparable to medical experts in terms of factuality.

Plain English Explanation

When patients have questions, it's crucial that the responses from AI language models are factually accurate. The authors of this paper recognized the need for an automated way to evaluate the factuality of these long-form responses in the medical domain.

To address this, they created MedLFQA, a dataset that can be used to assess the factuality of long-form answers generated by AI models. They also developed a new framework called OLAPH, which trains language models to produce more factual responses.

The OLAPH framework works by iteratively training the language model. First, it generates a set of possible responses and selects the one with the highest factuality score. Then, it trains the model to align its outputs with this preferred, more factual response.

Through this iterative process, the language model learns to generate long answers that are comparable to those provided by medical experts in terms of factuality. This is an important step in ensuring that AI-generated medical information is reliable and trustworthy.

Technical Explanation

The paper introduces MedLFQA, a benchmark dataset reconstructed from existing long-form question-answering datasets in the biomedical domain. This dataset is designed to facilitate the automatic evaluation of factuality in LLM-generated responses.

The authors also propose the OLAPH framework, a novel approach to improving the factuality of LLM outputs. OLAPH works by iteratively training the LLM to mitigate hallucinations, or the generation of factually incorrect information.

The process involves sampling multiple possible responses from the LLM and then selecting the highest-scoring response in terms of factuality. This preferred response is then used as a target for the LLM to align its outputs with, effectively training the model to generate more factual long-form answers.

The paper demonstrates that LLMs trained using the OLAPH framework show significant improvements in factuality, even on evaluation metrics not used during training. The authors highlight that a 7B-parameter LLM trained with OLAPH can provide long answers that are comparable to those of medical experts in terms of factuality.

Critical Analysis

The paper provides a valuable contribution to the field of long-form question-answering by large language models, particularly in the medical domain. The authors' development of the MedLFQA dataset and the OLAPH framework represent important steps towards ensuring the factuality of AI-generated medical information.

However, the paper does not address the potential biases or limitations that may be present in the underlying datasets used to construct MedLFQA. Additionally, the authors do not explore the scalability of the OLAPH framework to larger language models or its applicability to other domains beyond the medical field.

Further research could investigate the robustness of the OLAPH framework to different types of hallucinations, as well as its performance on factuality benchmarks and expert-curated datasets. Additionally, exploring the integration of OLAPH with FactCheck, a framework for evaluating the factual accuracy of language models, could provide valuable insights.

Conclusion

This paper presents a significant step forward in addressing the challenge of ensuring the factuality of long-form responses generated by large language models in the medical domain. The introduction of the MedLFQA dataset and the OLAPH framework provide valuable tools for researchers and practitioners working to develop trustworthy AI-powered medical assistants.

The findings showcase the potential for LLMs to generate long answers that rival the factuality of medical experts, which could have important implications for improving patient outcomes and reducing the burden on healthcare professionals. As the use of AI in medical decision-making continues to grow, this research highlights the critical importance of prioritizing factuality and reliability in language model development.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🎲

OLAPH: Improving Factuality in Biomedical Long-form Question Answering

Minbyul Jeong, Hyeon Hwang, Chanwoong Yoon, Taewhoo Lee, Jaewoo Kang

In the medical domain, numerous scenarios necessitate the long-form generation ability of large language models (LLMs). Specifically, when addressing patients' questions, it is essential that the model's response conveys factual claims, highlighting the need for an automated method to evaluate those claims. Thus, we introduce MedLFQA, a benchmark dataset reconstructed using long-form question-answering datasets related to the biomedical domain. We use MedLFQA to facilitate the automatic evaluations of factuality. We also propose OLAPH, a simple and novel framework that enables the improvement of factuality through automatic evaluations. The OLAPH framework iteratively trains LLMs to mitigate hallucinations using sampling predictions and preference optimization. In other words, we iteratively set the highest-scoring response as a preferred response derived from sampling predictions and train LLMs to align with the preferred response that improves factuality. We highlight that, even on evaluation metrics not used during training, LLMs trained with our OLAPH framework demonstrate significant performance improvement in factuality. Our findings reveal that a 7B LLM trained with our OLAPH framework can provide long answers comparable to the medical experts' answers in terms of factuality. We believe that our work could shed light on gauging the long-text generation ability of LLMs in the medical domain. Our code and datasets are available at https://github.com/dmis-lab/OLAPH}{https://github.com/dmis-lab/OLAPH.

5/22/2024

💬

Long-form factuality in large language models

Jerry Wei, Chengrun Yang, Xinying Song, Yifeng Lu, Nathan Hu, Jie Huang, Dustin Tran, Daiyi Peng, Ruibo Liu, Da Huang, Cosmo Du, Quoc V. Le

Large language models (LLMs) often generate content that contains factual errors when responding to fact-seeking prompts on open-ended topics. To benchmark a model's long-form factuality in open domains, we first use GPT-4 to generate LongFact, a prompt set comprising thousands of questions spanning 38 topics. We then propose that LLM agents can be used as automated evaluators for long-form factuality through a method which we call Search-Augmented Factuality Evaluator (SAFE). SAFE utilizes an LLM to break down a long-form response into a set of individual facts and to evaluate the accuracy of each fact using a multi-step reasoning process comprising sending search queries to Google Search and determining whether a fact is supported by the search results. Furthermore, we propose extending F1 score as an aggregated metric for long-form factuality. To do so, we balance the percentage of supported facts in a response (precision) with the percentage of provided facts relative to a hyperparameter representing a user's preferred response length (recall). Empirically, we demonstrate that LLM agents can outperform crowdsourced human annotators - on a set of ~16k individual facts, SAFE agrees with crowdsourced human annotators 72% of the time, and on a random subset of 100 disagreement cases, SAFE wins 76% of the time. At the same time, SAFE is more than 20 times cheaper than human annotators. We also benchmark thirteen language models on LongFact across four model families (Gemini, GPT, Claude, and PaLM-2), finding that larger language models generally achieve better long-form factuality. LongFact, SAFE, and all experimental code are available at https://github.com/google-deepmind/long-form-factuality.

4/5/2024

Fine-grained Hallucination Detection and Mitigation in Long-form Question Answering

Rachneet Sachdeva, Yixiao Song, Mohit Iyyer, Iryna Gurevych

Long-form question answering (LFQA) aims to provide thorough and in-depth answers to complex questions, enhancing comprehension. However, such detailed responses are prone to hallucinations and factual inconsistencies, challenging their faithful evaluation. This work introduces HaluQuestQA, the first hallucination dataset with localized error annotations for human-written and model-generated LFQA answers. HaluQuestQA comprises 698 QA pairs with 4.7k span-level error annotations for five different error types by expert annotators, along with preference judgments. Using our collected data, we thoroughly analyze the shortcomings of long-form answers and find that they lack comprehensiveness and provide unhelpful references. We train an automatic feedback model on this dataset that predicts error spans with incomplete information and provides associated explanations. Finally, we propose a prompt-based approach, Error-informed refinement, that uses signals from the learned feedback model to refine generated answers, which we show reduces hallucination and improves answer quality. Furthermore, humans find answers generated by our approach comprehensive and highly prefer them (84%) over the baseline answers.

7/17/2024

🧠

Towards a Holistic Evaluation of LLMs on Factual Knowledge Recall

Jiaqing Yuan, Lin Pan, Chung-Wei Hang, Jiang Guo, Jiarong Jiang, Bonan Min, Patrick Ng, Zhiguo Wang

Large language models (LLMs) have shown remarkable performance on a variety of NLP tasks, and are being rapidly adopted in a wide range of use cases. It is therefore of vital importance to holistically evaluate the factuality of their generated outputs, as hallucinations remain a challenging issue. In this work, we focus on assessing LLMs' ability to recall factual knowledge learned from pretraining, and the factors that affect this ability. To that end, we construct FACT-BENCH, a representative benchmark covering 20 domains, 134 property types, 3 answer types, and different knowledge popularity levels. We benchmark 31 models from 10 model families and provide a holistic assessment of their strengths and weaknesses. We observe that instruction-tuning hurts knowledge recall, as pretraining-only models consistently outperform their instruction-tuned counterparts, and positive effects of model scaling, as larger models outperform smaller ones for all model families. However, the best performance from GPT-4 still represents a large gap with the upper-bound. We additionally study the role of in-context exemplars using counterfactual demonstrations, which lead to significant degradation of factual knowledge recall for large models. By further decoupling model known and unknown knowledge, we find the degradation is attributed to exemplars that contradict a model's known knowledge, as well as the number of such exemplars. Lastly, we fine-tune LLaMA-7B in different settings of known and unknown knowledge. In particular, fine-tuning on a model's known knowledge is beneficial, and consistently outperforms fine-tuning on unknown and mixed knowledge. We will make our benchmark publicly available.

4/26/2024