ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling

2402.06118

Published 4/19/2024 by Siming Yan, Min Bai, Weifeng Chen, Xiong Zhou, Qixing Huang, Li Erran Li

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling

Abstract

By combining natural language understanding, generation capabilities, and breadth of knowledge of large language models with image perception, recent large vision language models (LVLMs) have shown unprecedented visual reasoning capabilities. However, the generated text often suffers from inaccurate grounding in the visual input, resulting in errors such as hallucination of nonexistent scene elements, missing significant parts of the scene, and inferring incorrect attributes of and relationships between objects. To address these issues, we introduce a novel framework, ViGoR(Visual Grounding Through Fine-Grained Reward Modeling) that utilizes fine-grained reward modeling to significantly enhance the visual grounding of LVLMs over pre-trained baselines. This improvement is efficiently achieved using much cheaper human evaluations instead of full supervisions, as well as automated methods. We show the effectiveness of our approach through a variety of evaluation methods and benchmarks. Additionally, we plan to release our human annotation comprising approximately 16,000 images and generated text pairs with fine-grained evaluations to contribute to related research in the community.

Create account to get full access

Overview

The paper introduces ViGoR, a method for improving the visual grounding capabilities of large vision-language models.
ViGoR uses fine-grained reward modeling to better align the model's visual and linguistic representations.
The authors demonstrate that ViGoR leads to significant improvements in various visual grounding tasks compared to existing approaches.

Plain English Explanation

Large vision-language models, such as CLIP and Flamingo, have shown impressive abilities to understand and relate visual and textual information. However, these models still struggle with "visual grounding" - the ability to precisely map language to specific visual elements in an image.

The ViGoR method aims to address this by using a more fine-grained approach to training the model. Instead of just rewarding the model for correctly matching an image and a caption, ViGoR provides additional rewards based on how well the model can identify the specific visual elements (e.g., objects, scenes, attributes) mentioned in the caption. This helps the model learn stronger associations between language and the corresponding visual features.

The authors demonstrate that ViGoR leads to significant improvements in various visual grounding tasks, such as image-text retrieval, referring expression comprehension, and visual question answering. This suggests that the fine-grained reward modeling approach used in ViGoR is an effective way to better align the visual and linguistic representations in large vision-language models.

Technical Explanation

The key technical contribution of the ViGoR paper is the fine-grained reward modeling approach used to train the vision-language model. Instead of just using a single reward signal for correctly matching an image and a caption, ViGoR provides additional rewards based on the model's ability to identify the specific visual elements (e.g., objects, scenes, attributes) mentioned in the caption.

The authors first extract fine-grained visual and linguistic features from the input image and caption, respectively. They then use these features to compute a set of fine-grained alignment scores, which measure how well the model's visual and linguistic representations are aligned for each specific visual element. These alignment scores are used to provide additional rewards during training, alongside the standard reward for correct image-caption matching.

The authors evaluate ViGoR on a range of visual grounding tasks, including image-text retrieval, referring expression comprehension, and visual question answering. The results show that ViGoR significantly outperforms existing approaches, demonstrating the effectiveness of the fine-grained reward modeling approach in improving the visual grounding capabilities of large vision-language models.

Critical Analysis

The ViGoR paper makes a compelling case for the benefits of fine-grained reward modeling in training large vision-language models. The authors provide a thorough evaluation of their approach across multiple visual grounding tasks, clearly demonstrating its advantages over existing methods.

However, the paper does not address some potential limitations or challenges that could arise with the ViGoR approach. For example, the additional computational and memory requirements of the fine-grained feature extraction and alignment scoring are not discussed, which could be an important consideration for real-world deployment of these models.

Additionally, the paper does not explore the potential trade-offs between the improvements in visual grounding and the model's performance on other tasks, such as general language understanding or image classification. It would be valuable to investigate whether the fine-grained training approach has any unintended consequences or side effects on the model's broader capabilities.

Overall, the ViGoR paper presents a promising direction for enhancing the visual grounding abilities of large vision-language models. However, further research is needed to fully understand the practical implications and potential limitations of this approach.

Conclusion

The ViGoR paper introduces a novel method for improving the visual grounding capabilities of large vision-language models. By using fine-grained reward modeling to better align the model's visual and linguistic representations, the authors demonstrate significant improvements in a range of visual grounding tasks.

This research suggests that the fine-grained approach to training vision-language models can be a powerful way to enhance their understanding and reasoning about the visual world. As large vision-language models continue to play an increasingly important role in various applications, such as image understanding, multimodal information retrieval, and human-robot interaction, the ViGoR method could have important practical implications for improving the real-world effectiveness of these models.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

HiVG: Hierarchical Multimodal Fine-grained Modulation for Visual Grounding

Linhui Xiao, Xiaoshan Yang, Fang Peng, Yaowei Wang, Changsheng Xu

Visual grounding, which aims to ground a visual region via natural language, is a task that heavily relies on cross-modal alignment. Existing works utilized uni-modal pre-trained models to transfer visual/linguistic knowledge separately while ignoring the multimodal corresponding information. Motivated by recent advancements in contrastive language-image pre-training and low-rank adaptation (LoRA) methods, we aim to solve the grounding task based on multimodal pre-training. However, there exists significant task gaps between pre-training and grounding. Therefore, to address these gaps, we propose a concise and efficient hierarchical multimodal fine-grained modulation framework, namely HiVG. Specifically, HiVG consists of a multi-layer adaptive cross-modal bridge and a hierarchical multimodal low-rank adaptation (Hi LoRA) paradigm. The cross-modal bridge can address the inconsistency between visual features and those required for grounding, and establish a connection between multi-level visual and text features. Hi LoRA prevents the accumulation of perceptual errors by adapting the cross-modal features from shallow to deep layers in a hierarchical manner. Experimental results on five datasets demonstrate the effectiveness of our approach and showcase the significant grounding capabilities as well as promising energy efficiency advantages. The project page: https://github.com/linhuixiao/HiVG.

4/23/2024

cs.CV

👀

Calibrated Self-Rewarding Vision Language Models

Yiyang Zhou, Zhiyuan Fan, Dongjie Cheng, Sihan Yang, Zhaorun Chen, Chenhang Cui, Xiyao Wang, Yun Li, Linjun Zhang, Huaxiu Yao

Large Vision-Language Models (LVLMs) have made substantial progress by integrating pre-trained large language models (LLMs) and vision models through instruction tuning. Despite these advancements, LVLMs often exhibit the hallucination phenomenon, where generated text responses appear linguistically plausible but contradict the input image, indicating a misalignment between image and text pairs. This misalignment arises because the model tends to prioritize textual information over visual input, even when both the language model and visual representations are of high quality. Existing methods leverage additional models or human annotations to curate preference data and enhance modality alignment through preference optimization. These approaches may not effectively reflect the target LVLM's preferences, making the curated preferences easily distinguishable. Our work addresses these challenges by proposing the Calibrated Self-Rewarding (CSR) approach, which enables the model to self-improve by iteratively generating candidate responses, evaluating the reward for each response, and curating preference data for fine-tuning. In the reward modeling, we employ a step-wise strategy and incorporate visual constraints into the self-rewarding process to place greater emphasis on visual input. Empirical results demonstrate that CSR enhances performance and reduces hallucinations across ten benchmarks and tasks, achieving substantial improvements over existing methods by 7.62%. Our empirical results are further supported by rigorous theoretical analysis, under mild assumptions, verifying the effectiveness of introducing visual constraints into the self-rewarding paradigm. Additionally, CSR shows compatibility with different vision-language models and the ability to incrementally improve performance through iterative fine-tuning. Our data and code are available at https://github.com/YiyangZhou/CSR.

6/3/2024

cs.LG cs.CL cs.CV

FGAIF: Aligning Large Vision-Language Models with Fine-grained AI Feedback

Liqiang Jing, Xinya Du

Large Vision-Language Models (LVLMs) have demonstrated proficiency in tackling a variety of visual-language tasks. However, current LVLMs suffer from misalignment between text and image modalities which causes three kinds of hallucination problems, i.e., object existence, object attribute, and object relationship. To tackle this issue, existing methods mainly utilize Reinforcement Learning (RL) to align modalities in LVLMs. However, they still suffer from three main limitations: (1) General feedback can not indicate the hallucination type contained in the response; (2) Sparse rewards only give the sequence-level reward for the whole response; and (3)Annotation cost is time-consuming and labor-intensive. To handle these limitations, we propose an innovative method to align modalities in LVLMs through Fine-Grained Artificial Intelligence Feedback (FGAIF), which mainly consists of three steps: AI-based Feedback Collection, Fine-grained Reward Model Training, and Reinforcement Learning with Fine-grained Reward. Specifically, We first utilize AI tools to predict the types of hallucination for each segment in the response and obtain a collection of fine-grained feedback. Then, based on the collected reward data, three specialized reward models are trained to produce dense rewards. Finally, a novel fine-grained feedback module is integrated into the Proximal Policy Optimization (PPO) algorithm. Extensive experiments are conducted on hallucination and general benchmarks, demonstrating the superior performance of our proposed method. Notably, compared with previous models trained with the RL-based aligning method, our proposed method is effective even with fewer parameters.

4/9/2024

cs.CV cs.CL

🏅

RL-VLM-F: Reinforcement Learning from Vision Language Foundation Model Feedback

Yufei Wang, Zhanyi Sun, Jesse Zhang, Zhou Xian, Erdem Biyik, David Held, Zackory Erickson

Reward engineering has long been a challenge in Reinforcement Learning (RL) research, as it often requires extensive human effort and iterative processes of trial-and-error to design effective reward functions. In this paper, we propose RL-VLM-F, a method that automatically generates reward functions for agents to learn new tasks, using only a text description of the task goal and the agent's visual observations, by leveraging feedbacks from vision language foundation models (VLMs). The key to our approach is to query these models to give preferences over pairs of the agent's image observations based on the text description of the task goal, and then learn a reward function from the preference labels, rather than directly prompting these models to output a raw reward score, which can be noisy and inconsistent. We demonstrate that RL-VLM-F successfully produces effective rewards and policies across various domains - including classic control, as well as manipulation of rigid, articulated, and deformable objects - without the need for human supervision, outperforming prior methods that use large pretrained models for reward generation under the same assumptions. Videos can be found on our project website: https://rlvlmf2024.github.io/

6/18/2024

cs.RO cs.AI cs.LG