Visual Hallucinations of Multi-modal Large Language Models

2402.14683

Published 6/18/2024 by Wen Huang, Hongbin Liu, Minxin Guo, Neil Zhenqiang Gong

Visual Hallucinations of Multi-modal Large Language Models

Abstract

Visual hallucination (VH) means that a multi-modal LLM (MLLM) imagines incorrect details about an image in visual question answering. Existing studies find VH instances only in existing image datasets, which results in biased understanding of MLLMs' performance under VH due to limited diversity of such VH instances. In this work, we propose a tool called VHTest to generate a diverse set of VH instances. Specifically, VHTest finds some initial VH instances in existing image datasets (e.g., COCO), generates a text description for each VH mode, and uses a text-to-image generative model (e.g., DALL-E-3) to generate VH images based on the text descriptions. We collect a benchmark dataset with 1,200 VH instances in 8 VH modes using VHTest. We find that existing MLLMs such as GPT-4V, LLaVA-1.5, and MiniGPT-v2 hallucinate for a large fraction of the instances in our benchmark. Moreover, we find that fine-tuning an MLLM using our benchmark dataset reduces its likelihood to hallucinate without sacrificing its performance on other benchmarks. Our benchmarks are publicly available: https://github.com/wenhuang2000/VHTest.

Create account to get full access

Overview

This paper presents a comprehensive survey on the topic of hallucination in large vision-language models (LVLMs).
It covers the definition and types of hallucination, detection and mitigation techniques, as well as current challenges and future research directions.
The paper also discusses the potential implications of hallucination in LVLMs, particularly in the context of safety and trustworthiness.

Plain English Explanation

Hallucination in large vision-language models (LVLMs) refers to the models' ability to generate plausible-sounding but factually incorrect information. This can be a significant issue, as these models are increasingly being used in real-world applications where accuracy and reliability are crucial.

The paper provides an in-depth look at the different types of hallucination that can occur in LVLMs, such as VH Modes and multimodal hallucination. It also examines various techniques that have been developed to detect and mitigate these issues, such as using cognitive prompts and prompt engineering.

The paper highlights the importance of addressing hallucination in LVLMs, as these models are increasingly being used in high-stakes applications like healthcare, finance, and legal decision-making. Detecting and mitigating hallucination is crucial to ensure the trustworthiness and safety of these models.

Technical Explanation

The paper begins by defining the concept of hallucination in the context of large vision-language models (LVLMs). It identifies two main types of hallucination: VH Modes, where the model generates visually plausible but factually incorrect images, and multimodal hallucination, where the model generates text that is inconsistent with the input image.

The paper then reviews various techniques that have been developed to detect and mitigate hallucination in LVLMs. These include using cognitive prompts, prompt engineering, and unified hallucination detection approaches that leverage both visual and textual information.

The paper also discusses the potential implications of hallucination in LVLMs, particularly in the context of safety and trustworthiness. It highlights the importance of detecting and mitigating these issues to ensure the reliable deployment of these models in high-stakes applications.

Critical Analysis

The paper provides a comprehensive overview of the current state of research on hallucination in LVLMs, but it also acknowledges several limitations and areas for further exploration. For example, the paper notes that the detection and mitigation techniques discussed are primarily focused on specific types of hallucination, and there may be a need for more unified approaches that can handle a broader range of hallucination scenarios.

Additionally, the paper suggests that further research is needed to understand the underlying causes of hallucination in LVLMs, as well as the potential societal implications of these issues. While the paper provides a solid foundation for addressing hallucination, there are still many open questions and challenges that the research community will need to grapple with moving forward.

Conclusion

This paper offers a comprehensive survey of the current state of research on hallucination in large vision-language models (LVLMs). It provides a clear definition of the problem, examines various detection and mitigation techniques, and discusses the potential implications for the safety and trustworthiness of these models.

The paper highlights the growing importance of addressing hallucination in LVLMs, as these models are increasingly being deployed in high-stakes applications where accuracy and reliability are critical. By providing a thorough overview of the current research landscape, this paper lays the groundwork for further advancements in this important area of study.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

A Survey on Hallucination in Large Vision-Language Models

Hanchao Liu, Wenyuan Xue, Yifei Chen, Dapeng Chen, Xiutian Zhao, Ke Wang, Liping Hou, Rongjun Li, Wei Peng

Recent development of Large Vision-Language Models (LVLMs) has attracted growing attention within the AI landscape for its practical implementation potential. However, ``hallucination'', or more specifically, the misalignment between factual visual content and corresponding textual generation, poses a significant challenge of utilizing LVLMs. In this comprehensive survey, we dissect LVLM-related hallucinations in an attempt to establish an overview and facilitate future mitigation. Our scrutiny starts with a clarification of the concept of hallucinations in LVLMs, presenting a variety of hallucination symptoms and highlighting the unique challenges inherent in LVLM hallucinations. Subsequently, we outline the benchmarks and methodologies tailored specifically for evaluating hallucinations unique to LVLMs. Additionally, we delve into an investigation of the root causes of these hallucinations, encompassing insights from the training data and model components. We also critically review existing methods for mitigating hallucinations. The open questions and future directions pertaining to hallucinations within LVLMs are discussed to conclude this survey.

5/7/2024

cs.CV cs.CL cs.LG

💬

Hallucination of Multimodal Large Language Models: A Survey

Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, Mike Zheng Shou

This survey presents a comprehensive analysis of the phenomenon of hallucination in multimodal large language models (MLLMs), also known as Large Vision-Language Models (LVLMs), which have demonstrated significant advancements and remarkable abilities in multimodal tasks. Despite these promising developments, MLLMs often generate outputs that are inconsistent with the visual content, a challenge known as hallucination, which poses substantial obstacles to their practical deployment and raises concerns regarding their reliability in real-world applications. This problem has attracted increasing attention, prompting efforts to detect and mitigate such inaccuracies. We review recent advances in identifying, evaluating, and mitigating these hallucinations, offering a detailed overview of the underlying causes, evaluation benchmarks, metrics, and strategies developed to address this issue. Additionally, we analyze the current challenges and limitations, formulating open questions that delineate potential pathways for future research. By drawing the granular classification and landscapes of hallucination causes, evaluation benchmarks, and mitigation methods, this survey aims to deepen the understanding of hallucinations in MLLMs and inspire further advancements in the field. Through our thorough and in-depth review, we contribute to the ongoing dialogue on enhancing the robustness and reliability of MLLMs, providing valuable insights and resources for researchers and practitioners alike. Resources are available at: https://github.com/showlab/Awesome-MLLM-Hallucination.

4/30/2024

cs.CV

VDGD: Mitigating LVLM Hallucinations in Cognitive Prompts by Bridging the Visual Perception Gap

Sreyan Ghosh, Chandra Kiran Reddy Evuru, Sonal Kumar, Utkarsh Tyagi, Oriol Nieto, Zeyu Jin, Dinesh Manocha

Recent interest in Large Vision-Language Models (LVLMs) for practical applications is moderated by the significant challenge of hallucination or the inconsistency between the factual information and the generated text. In this paper, we first perform an in-depth analysis of hallucinations and discover several novel insights about how and when LVLMs hallucinate. From our analysis, we show that: (1) The community's efforts have been primarily targeted towards reducing hallucinations related to visual recognition (VR) prompts (e.g., prompts that only require describing the image), thereby ignoring hallucinations for cognitive prompts (e.g., prompts that require additional skills like reasoning on contents of the image). (2) LVLMs lack visual perception, i.e., they can see but not necessarily understand or perceive the input image. We analyze responses to cognitive prompts and show that LVLMs hallucinate due to a perception gap: although LVLMs accurately recognize visual elements in the input image and possess sufficient cognitive skills, they struggle to respond accurately and hallucinate. To overcome this shortcoming, we propose Visual Description Grounded Decoding (VDGD), a simple, robust, and training-free method for alleviating hallucinations. Specifically, we first describe the image and add it as a prefix to the instruction. Next, during auto-regressive decoding, we sample from the plausible candidates according to their KL-Divergence (KLD) to the description, where lower KLD is given higher preference. Experimental results on several benchmarks and LVLMs show that VDGD improves significantly over other baselines in reducing hallucinations. We also propose VaLLu, a benchmark for the comprehensive evaluation of the cognitive capabilities of LVLMs.

5/27/2024

cs.CV cs.AI cs.CL

Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback

Wenyi Xiao, Ziwei Huang, Leilei Gan, Wanggui He, Haoyuan Li, Zhelun Yu, Hao Jiang, Fei Wu, Linchao Zhu

The rapidly developing Large Vision Language Models (LVLMs) have shown notable capabilities on a range of multi-modal tasks, but still face the hallucination phenomena where the generated texts do not align with the given contexts, significantly restricting the usages of LVLMs. Most previous work detects and mitigates hallucination at the coarse-grained level or requires expensive annotation (e.g., labeling by proprietary models or human experts). To address these issues, we propose detecting and mitigating hallucinations in LVLMs via fine-grained AI feedback. The basic idea is that we generate a small-size sentence-level hallucination annotation dataset by proprietary models, whereby we train a hallucination detection model which can perform sentence-level hallucination detection, covering primary hallucination types (i.e., object, attribute, and relationship). Then, we propose a detect-then-rewrite pipeline to automatically construct preference dataset for training hallucination mitigating model. Furthermore, we propose differentiating the severity of hallucinations, and introducing a Hallucination Severity-Aware Direct Preference Optimization (HSA-DPO) for mitigating hallucination in LVLMs by incorporating the severity of hallucinations into preference learning. Extensive experiments demonstrate the effectiveness of our method.

4/23/2024

cs.CV cs.AI cs.CL cs.LG