To Generate or to Retrieve? On the Effectiveness of Artificial Contexts for Medical Open-Domain Question Answering

2403.01924

Published 6/14/2024 by Giacomo Frisoni, Alessio Cocchieri, Alex Presepi, Gianluca Moro, Zaiqiao Meng

🤷

Abstract

Medical open-domain question answering demands substantial access to specialized knowledge. Recent efforts have sought to decouple knowledge from model parameters, counteracting architectural scaling and allowing for training on common low-resource hardware. The retrieve-then-read paradigm has become ubiquitous, with model predictions grounded on relevant knowledge pieces from external repositories such as PubMed, textbooks, and UMLS. An alternative path, still under-explored but made possible by the advent of domain-specific large language models, entails constructing artificial contexts through prompting. As a result, to generate or to retrieve is the modern equivalent of Hamlet's dilemma. This paper presents MedGENIE, the first generate-then-read framework for multiple-choice question answering in medicine. We conduct extensive experiments on MedQA-USMLE, MedMCQA, and MMLU, incorporating a practical perspective by assuming a maximum of 24GB VRAM. MedGENIE sets a new state-of-the-art in the open-book setting of each testbed, allowing a small-scale reader to outcompete zero-shot closed-book 175B baselines while using up to 706$times$ fewer parameters. Our findings reveal that generated passages are more effective than retrieved ones in attaining higher accuracy.

Create account to get full access

Overview

This paper presents MedGENIE, a new framework for medical question answering that generates artificial contexts through prompting, rather than relying solely on retrieving information from external sources.
The authors compare MedGENIE's performance to state-of-the-art retrieval-based models on multiple medical question answering benchmarks, including MedQA-USMLE, MedMCQA, and MMLU.
The key finding is that generated passages are more effective than retrieved ones in achieving higher accuracy, allowing a smaller-scale reader model to outperform much larger, zero-shot closed-book baselines.

Plain English Explanation

Answering medical questions accurately requires deep, specialized knowledge. Recent efforts have focused on decoupling this knowledge from the model parameters, allowing for training on more common hardware. The prevalent approach has been to retrieve relevant information from external sources and then use that to generate answers.

However, this paper explores an alternative path - constructing artificial contexts through prompting. The authors present MedGENIE, a "generate-then-read" framework for medical multiple-choice questions. MedGENIE sets a new state-of-the-art on several medical question answering benchmarks, outperforming much larger, zero-shot closed-book models while using up to 706 times fewer parameters.

The key insight is that the generated passages are more effective than retrieved ones at helping the model arrive at the correct answers. This suggests that prompting can be a powerful technique for boosting the performance of medical question answering systems, especially when working with limited computational resources.

Technical Explanation

The paper presents MedGENIE, a "generate-then-read" framework for medical multiple-choice question answering. Rather than solely relying on retrieving information from external sources like PubMed or textbooks, MedGENIE generates artificial passages through prompting to provide the necessary context for answering questions.

The authors conduct extensive experiments on three medical question answering benchmarks - MedQA-USMLE, MedMCQA, and MMLU. They compare MedGENIE's performance to state-of-the-art retrieval-based models, assuming a practical constraint of a maximum of 24GB VRAM.

The results show that MedGENIE sets a new state-of-the-art in the open-book setting of each testbed. Importantly, MedGENIE's smaller-scale reader model is able to outperform much larger, zero-shot closed-book baselines, using up to 706 times fewer parameters. The key finding is that the generated passages are more effective than retrieved ones in helping the model achieve higher accuracy.

Critical Analysis

The paper presents a compelling alternative to the predominant retrieve-then-read approach for medical question answering. By leveraging the capabilities of domain-specific large language models to construct artificial contexts through prompting, the authors demonstrate that generated passages can be more effective than retrieved ones.

However, the paper does not delve into the limitations or potential drawbacks of this approach. For example, it would be valuable to understand the extent to which the performance of MedGENIE is dependent on the quality and consistency of the generated passages, and how this might be affected by factors like prompt engineering or the underlying language model's biases.

Additionally, the authors' choice to focus on the open-book setting, while practical, raises questions about how MedGENIE would perform in more realistic, closed-book scenarios where access to external knowledge sources is limited. Further research could explore ways to seamlessly integrate the generated passages with the reader model, potentially bridging the gap between open-book and closed-book performance.

Overall, the paper presents a promising direction for medical question answering, but additional investigation is needed to fully understand the strengths, weaknesses, and broader implications of this generate-then-read approach.

Conclusion

This paper introduces MedGENIE, a novel framework for medical question answering that generates artificial contexts through prompting, rather than relying solely on retrieving information from external sources. The authors' experiments demonstrate that this approach can outperform state-of-the-art retrieval-based models on multiple medical question answering benchmarks, even with a smaller-scale reader model.

The key insight is that the generated passages are more effective than retrieved ones in helping the model arrive at the correct answers. This suggests that prompting techniques can be a powerful tool for boosting the performance of medical question answering systems, especially when working with limited computational resources.

While the paper presents a compelling alternative to the predominant retrieve-then-read approach, further research is needed to fully understand the limitations and potential of this generate-then-read framework. Exploring its robustness in more realistic, closed-book scenarios and investigating ways to seamlessly integrate the generated passages with the reader model could lead to important advancements in the field of medical question answering.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🛸

Retrieval Augmented Generation for Domain-specific Question Answering

Sanat Sharma, David Seunghyun Yoon, Franck Dernoncourt, Dewang Sultania, Karishma Bagga, Mengjiao Zhang, Trung Bui, Varun Kotte

Question answering (QA) has become an important application in the advanced development of large language models. General pre-trained large language models for question-answering are not trained to properly understand the knowledge or terminology for a specific domain, such as finance, healthcare, education, and customer service for a product. To better cater to domain-specific understanding, we build an in-house question-answering system for Adobe products. We propose a novel framework to compile a large question-answer database and develop the approach for retrieval-aware finetuning of a Large Language model. We showcase that fine-tuning the retriever leads to major improvements in the final generation. Our overall approach reduces hallucinations during generation while keeping in context the latest retrieval information for contextual grounding.

5/30/2024

cs.CL cs.AI cs.IR cs.LG

🛸

New!MKRAG: Medical Knowledge Retrieval Augmented Generation for Medical Question Answering

Yucheng Shi, Shaochen Xu, Tianze Yang, Zhengliang Liu, Tianming Liu, Xiang Li, Ninghao Liu

Large Language Models (LLMs), although powerful in general domains, often perform poorly on domain-specific tasks like medical question answering (QA). Moreover, they tend to function as black-boxes, making it challenging to modify their behavior. To address the problem, our study delves into retrieval augmented generation (RAG), aiming to improve LLM responses without the need for fine-tuning or retraining. Specifically, we propose a comprehensive retrieval strategy to extract medical facts from an external knowledge base, and then inject them into the query prompt for LLMs. Focusing on medical QA using the MedQA-SMILE dataset, we evaluate the impact of different retrieval models and the number of facts provided to the LLM. Notably, our retrieval-augmented Vicuna-7B model exhibited an accuracy improvement from 44.46% to 48.54%. This work underscores the potential of RAG to enhance LLM performance, offering a practical approach to mitigate the challenges of black-box LLMs.

7/1/2024

cs.CL cs.AI

Tool Calling: Enhancing Medication Consultation via Retrieval-Augmented Large Language Models

Zhongzhen Huang, Kui Xue, Yongqi Fan, Linjie Mu, Ruoyu Liu, Tong Ruan, Shaoting Zhang, Xiaofan Zhang

Large-scale language models (LLMs) have achieved remarkable success across various language tasks but suffer from hallucinations and temporal misalignment. To mitigate these shortcomings, Retrieval-augmented generation (RAG) has been utilized to provide external knowledge to facilitate the answer generation. However, applying such models to the medical domain faces several challenges due to the lack of domain-specific knowledge and the intricacy of real-world scenarios. In this study, we explore LLMs with RAG framework for knowledge-intensive tasks in the medical field. To evaluate the capabilities of LLMs, we introduce MedicineQA, a multi-round dialogue benchmark that simulates the real-world medication consultation scenario and requires LLMs to answer with retrieved evidence from the medicine database. MedicineQA contains 300 multi-round question-answering pairs, each embedded within a detailed dialogue history, highlighting the challenge posed by this knowledge-intensive task to current LLMs. We further propose a new textit{Distill-Retrieve-Read} framework instead of the previous textit{Retrieve-then-Read}. Specifically, the distillation and retrieval process utilizes a tool calling mechanism to formulate search queries that emulate the keyword-based inquiries used by search engines. With experimental results, we show that our framework brings notable performance improvements and surpasses the previous counterparts in the evidence retrieval process in terms of evidence retrieval accuracy. This advancement sheds light on applying RAG to the medical domain.

4/30/2024

cs.CL

Enhancing Knowledge Retrieval with In-Context Learning and Semantic Search through Generative AI

Mohammed-Khalil Ghali, Abdelrahman Farrag, Daehan Won, Yu Jin

Retrieving and extracting knowledge from extensive research documents and large databases presents significant challenges for researchers, students, and professionals in today's information-rich era. Existing retrieval systems, which rely on general-purpose Large Language Models (LLMs), often fail to provide accurate responses to domain-specific inquiries. Additionally, the high cost of pretraining or fine-tuning LLMs for specific domains limits their widespread adoption. To address these limitations, we propose a novel methodology that combines the generative capabilities of LLMs with the fast and accurate retrieval capabilities of vector databases. This advanced retrieval system can efficiently handle both tabular and non-tabular data, understand natural language user queries, and retrieve relevant information without fine-tuning. The developed model, Generative Text Retrieval (GTR), is adaptable to both unstructured and structured data with minor refinement. GTR was evaluated on both manually annotated and public datasets, achieving over 90% accuracy and delivering truthful outputs in 87% of cases. Our model achieved state-of-the-art performance with a Rouge-L F1 score of 0.98 on the MSMARCO dataset. The refined model, Generative Tabular Text Retrieval (GTR-T), demonstrated its efficiency in large database querying, achieving an Execution Accuracy (EX) of 0.82 and an Exact-Set-Match (EM) accuracy of 0.60 on the Spider dataset, using an open-source LLM. These efforts leverage Generative AI and In-Context Learning to enhance human-text interaction and make advanced AI capabilities more accessible. By integrating robust retrieval systems with powerful LLMs, our approach aims to democratize access to sophisticated AI tools, improving the efficiency, accuracy, and scalability of AI-driven information retrieval and database querying.

6/17/2024

cs.IR