Calibrated Self-Rewarding Vision Language Models

2405.14622

Published 6/3/2024 by Yiyang Zhou, Zhiyuan Fan, Dongjie Cheng, Sihan Yang, Zhaorun Chen, Chenhang Cui, Xiyao Wang, Yun Li, Linjun Zhang, Huaxiu Yao

cs.LG cs.CL cs.CV

👀

Abstract

Large Vision-Language Models (LVLMs) have made substantial progress by integrating pre-trained large language models (LLMs) and vision models through instruction tuning. Despite these advancements, LVLMs often exhibit the hallucination phenomenon, where generated text responses appear linguistically plausible but contradict the input image, indicating a misalignment between image and text pairs. This misalignment arises because the model tends to prioritize textual information over visual input, even when both the language model and visual representations are of high quality. Existing methods leverage additional models or human annotations to curate preference data and enhance modality alignment through preference optimization. These approaches may not effectively reflect the target LVLM's preferences, making the curated preferences easily distinguishable. Our work addresses these challenges by proposing the Calibrated Self-Rewarding (CSR) approach, which enables the model to self-improve by iteratively generating candidate responses, evaluating the reward for each response, and curating preference data for fine-tuning. In the reward modeling, we employ a step-wise strategy and incorporate visual constraints into the self-rewarding process to place greater emphasis on visual input. Empirical results demonstrate that CSR enhances performance and reduces hallucinations across ten benchmarks and tasks, achieving substantial improvements over existing methods by 7.62%. Our empirical results are further supported by rigorous theoretical analysis, under mild assumptions, verifying the effectiveness of introducing visual constraints into the self-rewarding paradigm. Additionally, CSR shows compatibility with different vision-language models and the ability to incrementally improve performance through iterative fine-tuning. Our data and code are available at https://github.com/YiyangZhou/CSR.

Create account to get full access

Overview

Large Vision-Language Models (LVLMs) have made significant progress by integrating pre-trained large language models (LLMs) and vision models through instruction tuning.
However, LVLMs often exhibit the "hallucination phenomenon," where generated text responses appear linguistically plausible but contradict the input image, indicating a misalignment between image and text pairs.
Existing methods leverage additional models or human annotations to curate preference data and enhance modality alignment, but these approaches may not effectively reflect the target LVLM's preferences.
This work proposes the Calibrated Self-Rewarding (CSR) approach to address these challenges, enabling the model to self-improve through iterative response generation, evaluation, and preference data curation.

Plain English Explanation

Large vision-language models are powerful AI systems that can understand and generate text based on visual information. These models have made significant progress, but they sometimes produce text that doesn't match the input image, a problem known as the "hallucination phenomenon."

Existing methods to fix this issue involve using extra models or human-annotated data to help the AI learn what kinds of text-image pairings are preferred. However, these approaches may not accurately reflect the model's own preferences.

The Calibrated Self-Rewarding (CSR) approach proposed in this research allows the model to improve itself. The model generates candidate responses, evaluates the quality of each one, and then uses that self-evaluation to fine-tune and enhance its own performance. Importantly, the self-rewarding process places more emphasis on the visual input, which helps address the misalignment between images and text.

The researchers found that CSR significantly improves performance and reduces hallucinations across a range of benchmarks and tasks, outperforming existing methods. This work shows how AI models can learn to self-correct and better align their outputs with the information they're given, which is an important step in making these systems more reliable and trustworthy.

Technical Explanation

The paper presents the Calibrated Self-Rewarding (CSR) approach to address the hallucination phenomenon in Large Vision-Language Models (LVLMs). The hallucination problem arises because LVLMs tend to prioritize textual information over visual input, even when both language and visual representations are of high quality.

CSR enables the model to self-improve by iteratively generating candidate responses, evaluating the reward for each response, and curating preference data for fine-tuning. The reward modeling employs a step-wise strategy and incorporates visual constraints into the self-rewarding process to place greater emphasis on visual input.

Empirical results demonstrate that CSR enhances performance and reduces hallucinations across ten benchmarks and tasks, achieving substantial improvements over existing methods that leverage additional models or human annotations to curate preference data. The researchers' theoretical analysis, under mild assumptions, verifies the effectiveness of introducing visual constraints into the self-rewarding paradigm.

Furthermore, CSR shows compatibility with different vision-language models and the ability to incrementally improve performance through iterative fine-tuning.

Critical Analysis

The paper presents a novel and promising approach to address the hallucination problem in LVLMs. The self-rewarding mechanism and incorporation of visual constraints are well-designed to better align the model's text generation with the input image.

However, the paper does not discuss potential limitations or broader implications of the CSR approach. For example, it would be valuable to understand how the model's performance and hallucination reduction scale with the size and complexity of the training data, or whether the self-rewarding process introduces any new biases or failure modes.

Additionally, the paper could have compared CSR to a more diverse set of existing methods for addressing hallucination, such as those that leverage multimodal attention or contrastive learning, to better contextualize the contributions of the proposed approach.

Overall, the CSR method represents an important step forward in improving the reliability and trustworthiness of LVLMs, but further research is needed to fully understand its limitations and potential broader applications.

Conclusion

The Calibrated Self-Rewarding (CSR) approach proposed in this paper addresses the hallucination problem in Large Vision-Language Models (LVLMs) by enabling the model to self-improve through iterative response generation, evaluation, and preference data curation. By incorporating visual constraints into the self-rewarding process, CSR better aligns the model's text generation with the input image, leading to substantial performance improvements and reductions in hallucinations across multiple benchmarks.

This work demonstrates the potential for AI systems to learn to self-correct and enhance their own capabilities, moving towards more reliable and trustworthy large-scale models that can effectively integrate visual and textual information. The findings in this paper have important implications for the development of robust and aligned vision-language models, which are critical for a wide range of applications, from image captioning to multimodal reasoning.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling

Siming Yan, Min Bai, Weifeng Chen, Xiong Zhou, Qixing Huang, Li Erran Li

By combining natural language understanding, generation capabilities, and breadth of knowledge of large language models with image perception, recent large vision language models (LVLMs) have shown unprecedented visual reasoning capabilities. However, the generated text often suffers from inaccurate grounding in the visual input, resulting in errors such as hallucination of nonexistent scene elements, missing significant parts of the scene, and inferring incorrect attributes of and relationships between objects. To address these issues, we introduce a novel framework, ViGoR(Visual Grounding Through Fine-Grained Reward Modeling) that utilizes fine-grained reward modeling to significantly enhance the visual grounding of LVLMs over pre-trained baselines. This improvement is efficiently achieved using much cheaper human evaluations instead of full supervisions, as well as automated methods. We show the effectiveness of our approach through a variety of evaluation methods and benchmarks. Additionally, we plan to release our human annotation comprising approximately 16,000 images and generated text pairs with fine-grained evaluations to contribute to related research in the community.

4/19/2024

cs.CV cs.AI

👀

Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement

Xiyao Wang, Jiuhai Chen, Zhaoyang Wang, Yuhang Zhou, Yiyang Zhou, Huaxiu Yao, Tianyi Zhou, Tom Goldstein, Parminder Bhatia, Furong Huang, Cao Xiao

Large vision-language models (LVLMs) have achieved impressive results in various visual question-answering and reasoning tasks through vision instruction tuning on specific datasets. However, there is still significant room for improvement in the alignment between visual and language modalities. Previous methods to enhance this alignment typically require external models or data, heavily depending on their capabilities and quality, which inevitably sets an upper bound on performance. In this paper, we propose SIMA, a framework that enhances visual and language modality alignment through self-improvement, eliminating the needs for external models or data. SIMA leverages prompts from existing vision instruction tuning datasets to self-generate responses and employs an in-context self-critic mechanism to select response pairs for preference tuning. The key innovation is the introduction of three vision metrics during the in-context self-critic process, which can guide the LVLM in selecting responses that enhance image comprehension. Through experiments across 14 hallucination and comprehensive benchmarks, we demonstrate that SIMA not only improves model performance across all benchmarks but also achieves superior modality alignment, outperforming previous approaches.

6/11/2024

cs.CV cs.AI cs.CL cs.LG

Self-Supervised Visual Preference Alignment

Ke Zhu, Liang Zhao, Zheng Ge, Xiangyu Zhang

This paper makes the first attempt towards unsupervised preference alignment in Vision-Language Models (VLMs). We generate chosen and rejected responses with regard to the original and augmented image pairs, and conduct preference alignment with direct preference optimization. It is based on a core idea: properly designed augmentation to the image input will induce VLM to generate false but hard negative responses, which helps the model to learn from and produce more robust and powerful answers. The whole pipeline no longer hinges on supervision from GPT4 or human involvement during alignment, and is highly efficient with few lines of code. With only 8k randomly sampled unsupervised data, it achieves 90% relative score to GPT-4 on complex reasoning in LLaVA-Bench, and improves LLaVA-7B/13B by 6.7%/5.6% score on complex multi-modal benchmark MM-Vet. Visualizations shows its improved ability to align with user-intentions. A series of ablations are firmly conducted to reveal the latent mechanism of the approach, which also indicates its potential towards further scaling. Code will be available.

4/17/2024

cs.CV cs.AI cs.CL cs.LG

Self-Training Large Language Models for Improved Visual Program Synthesis With Visual Reinforcement

Zaid Khan, Vijay Kumar BG, Samuel Schulter, Yun Fu, Manmohan Chandraker

Visual program synthesis is a promising approach to exploit the reasoning abilities of large language models for compositional computer vision tasks. Previous work has used few-shot prompting with frozen LLMs to synthesize visual programs. Training an LLM to write better visual programs is an attractive prospect, but it is unclear how to accomplish this. No dataset of visual programs for training exists, and acquisition of a visual program dataset cannot be easily crowdsourced due to the need for expert annotators. To get around the lack of direct supervision, we explore improving the program synthesis abilities of an LLM using feedback from interactive experience. We propose a method where we exploit existing annotations for a vision-language task to improvise a coarse reward signal for that task, treat the LLM as a policy, and apply reinforced self-training to improve the visual program synthesis ability of the LLM for that task. We describe a series of experiments on object detection, compositional visual question answering, and image-text retrieval, and show that in each case, the self-trained LLM outperforms or performs on par with few-shot frozen LLMs that are an order of magnitude larger. Website: https://zaidkhan.me/ViReP

4/9/2024

cs.CV