CLIP-DPO: Vision-Language Models as a Source of Preference for Fixing Hallucinations in LVLMs

Read original: arXiv:2408.10433 - Published 8/21/2024 by Yassine Ouali, Adrian Bulat, Brais Martinez, Georgios Tzimiropoulos

CLIP-DPO: Vision-Language Models as a Source of Preference for Fixing Hallucinations in LVLMs

Overview

Introduces a new method called CLIP-DPO for fixing hallucinations in large vision-language models (LVLMs)
Leverages vision-language models like CLIP as a source of preference for improving the outputs of LVLMs
Focuses on mitigating hallucinations, where LVLMs generate content that is not supported by the input

Plain English Explanation

CLIP is a powerful vision-language model that can understand the relationship between images and text. The researchers behind this paper propose using CLIP as a way to fix a common problem with large vision-language models (LVLMs) - hallucinations.

Hallucinations occur when an LVLM generates content that is not actually supported by the input it was given. For example, if you show an LVLM a picture of a dog and ask it to describe the image, it might say "The image shows a small, brown dog playing fetch with a ball." But upon closer inspection, there is no ball in the image. The LVLM has hallucinated the existence of a ball.

The key insight of this paper is that CLIP's understanding of the relationship between images and text can be used to identify and fix these hallucinations. By comparing the LVLM's output to CLIP's understanding of the image, the researchers can identify when the LVLM is generating content that isn't grounded in the actual input. They can then use this information to "steer" the LVLM towards more accurate and faithful outputs.

The researchers call their method CLIP-DPO, which stands for "CLIP-based Direct Preference Optimization." The basic idea is to use CLIP to define a "preference function" that rewards outputs that are aligned with CLIP's understanding of the image-text relationship, and penalizes hallucinated outputs. This preference function is then used to fine-tune the LVLM, pushing it to generate more accurate and grounded outputs.

Technical Explanation

The key technical components of CLIP-DPO are:

CLIP-based Preference Function: The researchers use the CLIP model to define a preference function that scores the alignment between an LVLM's output and the input image. This preference function rewards outputs that are consistent with CLIP's understanding of the image-text relationship.
Direct Preference Optimization: The researchers then fine-tune the LVLM using this CLIP-based preference function, a technique they call "Direct Preference Optimization." This encourages the LVLM to generate outputs that are more aligned with the CLIP model's preferences, and thus less prone to hallucinations.
Evaluation on Vision-Language Tasks: The researchers evaluate CLIP-DPO on a range of vision-language tasks, including image captioning, visual question answering, and open-ended image generation. They show that CLIP-DPO significantly outperforms baseline LVLM models in terms of reducing hallucinations while maintaining strong task performance.

Critical Analysis

The key innovation of this work is the use of a vision-language model like CLIP as a source of preference to improve the outputs of LVLMs. This is a clever and well-motivated approach, as CLIP's understanding of the image-text relationship can provide valuable guidance for identifying and mitigating hallucinations.

That said, the paper does note some limitations of the CLIP-DPO approach. For example, the preference function defined by CLIP may not capture all aspects of the image-text relationship, and there could be cases where CLIP's preferences diverge from human preferences. Additionally, the fine-tuning process used in CLIP-DPO could potentially lead to forgetting or degradation of the LVLM's performance on other tasks.

Further research could explore ways to address these limitations, such as incorporating additional sources of preference beyond just CLIP, or developing more sophisticated fine-tuning approaches that better preserve the LVLM's broader capabilities. Nonetheless, this paper represents an important step towards improving the reliability and faithfulness of large vision-language models.

Conclusion

This paper introduces CLIP-DPO, a novel method for mitigating hallucinations in large vision-language models (LVLMs) by leveraging the understanding of the image-text relationship in vision-language models like CLIP. By using CLIP as a source of preference, the researchers are able to significantly improve the accuracy and grounding of LVLM outputs across a range of vision-language tasks. While the approach has some limitations, it represents an important advance in the ongoing effort to build more reliable and trustworthy large-scale vision-language models.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

CLIP-DPO: Vision-Language Models as a Source of Preference for Fixing Hallucinations in LVLMs

Yassine Ouali, Adrian Bulat, Brais Martinez, Georgios Tzimiropoulos

Despite recent successes, LVLMs or Large Vision Language Models are prone to hallucinating details like objects and their properties or relations, limiting their real-world deployment. To address this and improve their robustness, we present CLIP-DPO, a preference optimization method that leverages contrastively pre-trained Vision-Language (VL) embedding models, such as CLIP, for DPO-based optimization of LVLMs. Unlike prior works tackling LVLM hallucinations, our method does not rely on paid-for APIs, and does not require additional training data or the deployment of other external LVLMs. Instead, starting from the initial pool of supervised fine-tuning data, we generate a diverse set of predictions, which are ranked based on their CLIP image-text similarities, and then filtered using a robust rule-based approach to obtain a set of positive and negative pairs for DPO-based training. We applied CLIP-DPO fine-tuning to the MobileVLM-v2 family of models and to LlaVA-1.5, in all cases observing significant improvements in terms of hallucination reduction over baseline models. We also observe better performance for zero-shot classification, suggesting improved grounding capabilities, and verify that the original performance on standard LVLM benchmarks is overall preserved.

8/21/2024

Seeing is Believing: Mitigating Hallucination in Large Vision-Language Models via CLIP-Guided Decoding

Ailin Deng, Zhirui Chen, Bryan Hooi

Large Vision-Language Models (LVLMs) are susceptible to object hallucinations, an issue in which their generated text contains non-existent objects, greatly limiting their reliability and practicality. Current approaches often rely on the model's token likelihoods or other internal information, instruction tuning on additional datasets, or incorporating complex external tools. We first perform empirical analysis on sentence-level LVLM hallucination, finding that CLIP similarity to the image acts as a stronger and more robust indicator of hallucination compared to token likelihoods. Motivated by this, we introduce our CLIP-Guided Decoding (CGD) approach, a straightforward but effective training-free approach to reduce object hallucination at decoding time. CGD uses CLIP to guide the model's decoding process by enhancing visual grounding of generated text with the image. Experiments demonstrate that CGD effectively mitigates object hallucination across multiple LVLM families while preserving the utility of text generation. Codes are available at https://github.com/d-ailin/CLIP-Guided-Decoding.

4/24/2024

Direct Preference Optimization for Suppressing Hallucinated Prior Exams in Radiology Report Generation

Oishi Banerjee, Hong-Yu Zhou, Subathra Adithan, Stephen Kwak, Kay Wu, Pranav Rajpurkar

Recent advances in generative vision-language models (VLMs) have exciting potential implications for AI in radiology, yet VLMs are also known to produce hallucinations, nonsensical text, and other unwanted behaviors that can waste clinicians' time and cause patient harm. Drawing on recent work on direct preference optimization (DPO), we propose a simple method for modifying the behavior of pretrained VLMs performing radiology report generation by suppressing unwanted types of generations. We apply our method to the prevention of hallucinations of prior exams, addressing a long-established problem behavior in models performing chest X-ray report generation. Across our experiments, we find that DPO fine-tuning achieves a 3.2-4.8x reduction in lines hallucinating prior exams while maintaining model performance on clinical accuracy metrics. Our work is, to the best of our knowledge, the first work to apply DPO to medical VLMs, providing a data- and compute- efficient way to suppress problem behaviors while maintaining overall clinical accuracy.

6/18/2024

Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language Models

Christian Schlarmann, Naman Deep Singh, Francesco Croce, Matthias Hein

Multi-modal foundation models like OpenFlamingo, LLaVA, and GPT-4 are increasingly used for various real-world tasks. Prior work has shown that these models are highly vulnerable to adversarial attacks on the vision modality. These attacks can be leveraged to spread fake information or defraud users, and thus pose a significant risk, which makes the robustness of large multi-modal foundation models a pressing problem. The CLIP model, or one of its variants, is used as a frozen vision encoder in many large vision-language models (LVLMs), e.g. LLaVA and OpenFlamingo. We propose an unsupervised adversarial fine-tuning scheme to obtain a robust CLIP vision encoder, which yields robustness on all vision down-stream tasks (LVLMs, zero-shot classification) that rely on CLIP. In particular, we show that stealth-attacks on users of LVLMs by a malicious third party providing manipulated images are no longer possible once one replaces the original CLIP model with our robust one. No retraining or fine-tuning of the down-stream LVLMs is required. The code and robust models are available at https://github.com/chs20/RobustVLM

6/6/2024