Enhancing Large Language Models with Domain-specific Retrieval Augment Generation: A Case Study on Long-form Consumer Health Question Answering in Ophthalmology

Read original: arXiv:2409.13902 - Published 9/24/2024 by Aidan Gilson, Xuguang Ai, Thilaka Arunachalam, Ziyou Chen, Ki Xiong Cheong, Amisha Dave, Cameron Duic, Mercy Kibe, Annette Kaminaka, Minali Prasad and 12 others

💬

Overview

Large Language Models (LLMs) have great potential in the medical field, but they can generate responses with unreliable or fabricated evidence.
Retrieval Augmented Generation (RAG) is a popular approach to address this issue, but few studies have evaluated its use in specific domains like ophthalmology.
This study developed a RAG pipeline with 70,000 ophthalmology documents and evaluated its performance on long-form consumer health questions, comparing LLMs with and without RAG.

Plain English Explanation

Large language models (LLMs) are powerful AI systems that can generate human-like text on a wide range of topics. In the medical field, LLMs could be useful for tasks like answering patient questions or summarizing research. However, LLMs can sometimes produce responses that are not based on real evidence, but instead make up or "hallucinate" information.

To address this issue, researchers have developed an approach called Retrieval Augmented Generation (RAG). RAG works by having the LLM search through a database of relevant documents and use information from those documents to generate its responses. This helps ensure the response is grounded in real evidence rather than made up information.

In this study, the researchers developed a RAG system specifically for ophthalmology (eye care) by building a database of 70,000 relevant documents. They then evaluated how well this RAG system performed compared to a standard LLM on a set of 100 long-form consumer health questions about eye care.

The results showed that the standard LLM frequently provided responses that included made-up or incorrect information. But when the LLM was combined with the RAG system, it was able to provide responses that were much more accurate and grounded in real evidence from the document database. The RAG system also helped the LLM attribute the evidence it was using more clearly.

However, the RAG system was not perfect - it still struggled with some aspects like fully answering the questions and providing the most relevant evidence. The researchers conclude that while RAG is a promising approach, there is still more work to be done to fully address the challenges of using LLMs in sensitive domains like healthcare.

Technical Explanation

This study developed and evaluated a Retrieval Augmented Generation (RAG) pipeline for using large language models (LLMs) in the ophthalmology domain. The researchers built a database of over 70,000 ophthalmology-specific documents to support the RAG system.

They then conducted a case study evaluating LLM responses to 100 long-form consumer health questions about eye care, comparing performance with and without the RAG system. The evaluation focused on four key metrics:

Factuality of Evidence: Were the references and facts provided in the LLM responses accurate and grounded in real evidence?
Selection and Ranking of Evidence: How well did the RAG system identify and rank the most relevant supporting documents?
Attribution of Evidence: How clearly did the LLM responses attribute the evidence they were using?
Answer Accuracy and Completeness: How accurate and comprehensive were the final answers provided by the LLM?

The results showed that without RAG, the LLM responses included a high rate of "hallucinated" or fabricated evidence (45.3% of references). The RAG system significantly reduced this, with only 18.8% minor hallucinations. It also increased the proportion of correct references from 20.6% to 54.5%.

The RAG system was able to retrieve and rank relevant documents well, with 62.5% of the top 10 documents being selected as key references. However, this improvement in evidence grounding came at a slight cost to overall answer accuracy and completeness.

Overall, the study demonstrates that RAG can be an effective approach for enhancing LLM performance in domain-specific applications like healthcare, but there are still challenges to overcome in fully addressing the issue of evidence hallucination.

Critical Analysis

The researchers provide a thorough evaluation of their RAG pipeline, systematically assessing its impact on various aspects of LLM performance for a specific medical domain. This level of detailed, domain-specific analysis is important, as the performance of these systems can vary greatly depending on the application area.

One key strength of the study is the focus on factuality of evidence, which is a critical concern for using LLMs in sensitive domains like healthcare. The researchers' finding that standard LLMs frequently provide responses with fabricated or incorrect evidence is an important wake-up call, and highlights the need for approaches like RAG to address this issue.

However, the researchers also note that RAG did not completely solve the problem, with some decrease in overall answer accuracy and completeness. This suggests there is still room for improvement in how these retrieval-augmented systems balance evidence grounding with other aspects of response quality.

Additionally, the study was limited to a specific domain (ophthalmology) and a relatively small set of questions. Further research would be needed to assess how well these findings generalize to other medical specialties or broader types of queries.

Overall, this study provides valuable insights into the strengths and limitations of using RAG to enhance LLMs for healthcare applications. It underscores the importance of thorough, domain-specific evaluation and the ongoing challenges in developing reliable, evidence-based AI systems for sensitive domains.

Conclusion

This study demonstrates that large language models (LLMs) frequently generate responses with unreliable or fabricated evidence when applied to medical questions, raising significant concerns for their use in sensitive domains like healthcare. The researchers developed a Retrieval Augmented Generation (RAG) pipeline to address this issue, integrating a large database of ophthalmology-specific documents.

Evaluating the RAG system on long-form consumer health questions, the researchers found that it substantially reduced the proportion of hallucinated and erroneous evidence in LLM responses. However, the RAG system also encountered some challenges, leading to slight decreases in overall answer accuracy and completeness.

These findings highlight both the promise and limitations of current approaches for enhancing LLMs with retrieval-based mechanisms. While RAG represents an important step forward, there is still work to be done to develop AI systems that can reliably provide high-quality, evidence-based information in critical domains. Continued research and evaluation in specific application areas will be key to realizing the full potential of large language models in fields like healthcare.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

💬

Enhancing Large Language Models with Domain-specific Retrieval Augment Generation: A Case Study on Long-form Consumer Health Question Answering in Ophthalmology

Aidan Gilson, Xuguang Ai, Thilaka Arunachalam, Ziyou Chen, Ki Xiong Cheong, Amisha Dave, Cameron Duic, Mercy Kibe, Annette Kaminaka, Minali Prasad, Fares Siddig, Maxwell Singer, Wendy Wong, Qiao Jin, Tiarnan D. L. Keenan, Xia Hu, Emily Y. Chew, Zhiyong Lu, Hua Xu, Ron A. Adelman, Yih-Chung Tham, Qingyu Chen

Despite the potential of Large Language Models (LLMs) in medicine, they may generate responses lacking supporting evidence or based on hallucinated evidence. While Retrieval Augment Generation (RAG) is popular to address this issue, few studies implemented and evaluated RAG in downstream domain-specific applications. We developed a RAG pipeline with 70,000 ophthalmology-specific documents that retrieve relevant documents to augment LLMs during inference time. In a case study on long-form consumer health questions, we systematically evaluated the responses including over 500 references of LLMs with and without RAG on 100 questions with 10 healthcare professionals. The evaluation focuses on factuality of evidence, selection and ranking of evidence, attribution of evidence, and answer accuracy and completeness. LLMs without RAG provided 252 references in total. Of which, 45.3% hallucinated, 34.1% consisted of minor errors, and 20.6% were correct. In contrast, LLMs with RAG significantly improved accuracy (54.5% being correct) and reduced error rates (18.8% with minor hallucinations and 26.7% with errors). 62.5% of the top 10 documents retrieved by RAG were selected as the top references in the LLM response, with an average ranking of 4.9. The use of RAG also improved evidence attribution (increasing from 1.85 to 2.49 on a 5-point scale, P<0.001), albeit with slight decreases in accuracy (from 3.52 to 3.23, P=0.03) and completeness (from 3.47 to 3.27, P=0.17). The results demonstrate that LLMs frequently exhibited hallucinated and erroneous evidence in the responses, raising concerns for downstream applications in the medical domain. RAG substantially reduced the proportion of such evidence but encountered challenges.

9/24/2024

Tool Calling: Enhancing Medication Consultation via Retrieval-Augmented Large Language Models

Zhongzhen Huang, Kui Xue, Yongqi Fan, Linjie Mu, Ruoyu Liu, Tong Ruan, Shaoting Zhang, Xiaofan Zhang

Large-scale language models (LLMs) have achieved remarkable success across various language tasks but suffer from hallucinations and temporal misalignment. To mitigate these shortcomings, Retrieval-augmented generation (RAG) has been utilized to provide external knowledge to facilitate the answer generation. However, applying such models to the medical domain faces several challenges due to the lack of domain-specific knowledge and the intricacy of real-world scenarios. In this study, we explore LLMs with RAG framework for knowledge-intensive tasks in the medical field. To evaluate the capabilities of LLMs, we introduce MedicineQA, a multi-round dialogue benchmark that simulates the real-world medication consultation scenario and requires LLMs to answer with retrieved evidence from the medicine database. MedicineQA contains 300 multi-round question-answering pairs, each embedded within a detailed dialogue history, highlighting the challenge posed by this knowledge-intensive task to current LLMs. We further propose a new textit{Distill-Retrieve-Read} framework instead of the previous textit{Retrieve-then-Read}. Specifically, the distillation and retrieval process utilizes a tool calling mechanism to formulate search queries that emulate the keyword-based inquiries used by search engines. With experimental results, we show that our framework brings notable performance improvements and surpasses the previous counterparts in the evidence retrieval process in terms of evidence retrieval accuracy. This advancement sheds light on applying RAG to the medical domain.

4/30/2024

🛸

MKRAG: Medical Knowledge Retrieval Augmented Generation for Medical Question Answering

Yucheng Shi, Shaochen Xu, Tianze Yang, Zhengliang Liu, Tianming Liu, Quanzheng Li, Xiang Li, Ninghao Liu

Large Language Models (LLMs), although powerful in general domains, often perform poorly on domain-specific tasks such as medical question answering (QA). In addition, LLMs tend to function as black-boxes, making it challenging to modify their behavior. To address the problem, our work employs a transparent process of retrieval augmented generation (RAG), aiming to improve LLM responses without the need for fine-tuning or retraining. Specifically, we propose a comprehensive retrieval strategy to extract medical facts from an external knowledge base, and then inject them into the LLM's query prompt. Focusing on medical QA, we evaluate the impact of different retrieval models and the number of facts on LLM performance using the MedQA-SMILE dataset. Notably, our retrieval-augmented Vicuna-7B model exhibited an accuracy improvement from 44.46% to 48.54%. This work underscores the potential of RAG to enhance LLM performance, offering a practical approach to mitigate the challenges posed by black-box LLMs.

8/19/2024

💬

A Survey on RAG Meets LLMs: Towards Retrieval-Augmented Large Language Models

Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, Qing Li

As one of the most advanced techniques in AI, Retrieval-Augmented Generation (RAG) can offer reliable and up-to-date external knowledge, providing huge convenience for numerous tasks. Particularly in the era of AI-Generated Content (AIGC), the powerful capacity of retrieval in providing additional knowledge enables RAG to assist existing generative AI in producing high-quality outputs. Recently, Large Language Models (LLMs) have demonstrated revolutionary abilities in language understanding and generation, while still facing inherent limitations, such as hallucinations and out-of-date internal knowledge. Given the powerful abilities of RAG in providing the latest and helpful auxiliary information, Retrieval-Augmented Large Language Models (RA-LLMs) have emerged to harness external and authoritative knowledge bases, rather than solely relying on the model's internal knowledge, to augment the generation quality of LLMs. In this survey, we comprehensively review existing research studies in RA-LLMs, covering three primary technical perspectives: architectures, training strategies, and applications. As the preliminary knowledge, we briefly introduce the foundations and recent advances of LLMs. Then, to illustrate the practical significance of RAG for LLMs, we systematically review mainstream relevant work by their architectures, training strategies, and application areas, detailing specifically the challenges of each and the corresponding capabilities of RA-LLMs. Finally, to deliver deeper insights, we discuss current limitations and several promising directions for future research. Updated information about this survey can be found at https://advanced-recommender-systems.github.io/RAG-Meets-LLMs/

6/18/2024