Self-Supervised Visual Preference Alignment

2404.10501

Published 4/17/2024 by Ke Zhu, Liang Zhao, Zheng Ge, Xiangyu Zhang

Self-Supervised Visual Preference Alignment

Abstract

This paper makes the first attempt towards unsupervised preference alignment in Vision-Language Models (VLMs). We generate chosen and rejected responses with regard to the original and augmented image pairs, and conduct preference alignment with direct preference optimization. It is based on a core idea: properly designed augmentation to the image input will induce VLM to generate false but hard negative responses, which helps the model to learn from and produce more robust and powerful answers. The whole pipeline no longer hinges on supervision from GPT4 or human involvement during alignment, and is highly efficient with few lines of code. With only 8k randomly sampled unsupervised data, it achieves 90% relative score to GPT-4 on complex reasoning in LLaVA-Bench, and improves LLaVA-7B/13B by 6.7%/5.6% score on complex multi-modal benchmark MM-Vet. Visualizations shows its improved ability to align with user-intentions. A series of ablations are firmly conducted to reveal the latent mechanism of the approach, which also indicates its potential towards further scaling. Code will be available.

Create account to get full access

Overview

This paper introduces a novel self-supervised approach for aligning visual preferences with language models.
The method leverages language models to provide preference signals for visual data, enabling the model to learn visual preferences without explicit labels.
The authors demonstrate the effectiveness of their approach on several downstream tasks, including weakly supervised 3D visual grounding and strengthening multimodal large language models.

Plain English Explanation

The paper explores a new way to train AI models to understand and learn visual preferences without needing large datasets of labeled examples. The key idea is to use powerful language models, which can capture human-like preferences and attitudes, to provide guidance to the visual model during training.

This is done in a "self-supervised" manner, where the language model is used to generate preference signals for visual data, and the visual model then learns to align its internal representations with these preference signals. This allows the visual model to acquire an understanding of aesthetics, attractiveness, and other subjective visual properties without relying on explicit human labeling.

The authors show that this approach leads to improved performance on tasks like 3D visual grounding, where the model needs to associate language descriptions with 3D object locations. It also helps to strengthen multimodal large language models by providing richer visual understanding.

The key benefit of this method is that it can capture subjective human preferences without the need for large, expensive datasets of manual labels. This makes it more scalable and practical for real-world applications.

Technical Explanation

The paper proposes a self-supervised learning framework for aligning visual preferences with language models. The core idea is to use a language model to provide preference signals for visual data, which can then be used to train a visual model to align its internal representations with these preferences.

Specifically, the authors use a pretrained language model to generate preference scores for a large corpus of unlabeled images. These preference scores encode the model's understanding of visual aesthetics, attractiveness, and other subjective properties. The visual model is then trained to predict these preference scores, enabling it to learn visual preferences in a self-supervised manner.

The authors demonstrate the effectiveness of this approach on several downstream tasks, including weakly supervised 3D visual grounding, where the model needs to associate language descriptions with 3D object locations, and strengthening multimodal large language models, where the visual preference alignment helps to improve the language model's understanding of the visual world.

The key technical contributions of the paper include the self-supervised preference alignment framework, the use of language models to provide preference signals, and the demonstration of the approach's benefits on several challenging tasks.

Critical Analysis

The paper presents a compelling approach to addressing the challenge of learning visual preferences without extensive human labeling. By leveraging language models to provide preference signals, the authors show that visual models can acquire a nuanced understanding of aesthetics and other subjective visual properties in a more scalable and practical way.

One potential limitation of the approach is that it relies on the quality and biases of the underlying language model. If the language model's preferences are skewed or do not align well with human preferences, this could limit the effectiveness of the self-supervised learning process. The authors do not extensively discuss potential biases or limitations of the language model used in their experiments.

Additionally, the paper does not explore the interpretability or explainability of the learned visual preferences. It would be interesting to understand how the visual model's internal representations relate to specific aesthetic or subjective qualities, and whether these alignments can be easily interpreted by human observers.

Further research could also investigate the generalization of the learned visual preferences to broader domains or tasks beyond the specific ones explored in the paper. Exploring the transferability and robustness of the approach would be a valuable area for future work.

Overall, the paper presents a promising direction for addressing the challenge of learning visual preferences in a more scalable and practical way. The critical analysis highlights areas for potential improvement and further exploration to fully realize the benefits of this self-supervised approach.

Conclusion

This paper introduces a novel self-supervised approach for aligning visual preferences with language models. By leveraging the preference signals generated by a language model, the visual model can learn to understand and internalize subjective visual properties without the need for extensive human labeling.

The authors demonstrate the effectiveness of this approach on several downstream tasks, including weakly supervised 3D visual grounding and strengthening multimodal large language models. This suggests that the self-supervised visual preference alignment can provide significant benefits for a range of real-world applications that require a nuanced understanding of visual aesthetics and subjective qualities.

The critical analysis highlights the need to further explore the potential biases and limitations of the underlying language model, as well as the interpretability and generalization of the learned visual preferences. Addressing these areas could lead to further advancements in this promising field of research.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Aligning Large Language Models with Self-generated Preference Data

Dongyoung Kim, Kimin Lee, Jinwoo Shin, Jaehyung Kim

Aligning large language models (LLMs) with human preferences becomes a key component to obtaining state-of-the-art performance, but it yields a huge cost to construct a large human-annotated preference dataset. To tackle this problem, we propose a new framework that boosts the alignment of LLMs through Self-generated Preference data (Selfie) using only a very small amount of human-annotated preference data. Our key idea is leveraging the human prior knowledge within the small (seed) data and progressively improving the alignment of LLM, by iteratively generating the responses and learning from them with the self-annotated preference data. To be specific, we propose to derive the preference label from the logits of LLM to explicitly extract the model's inherent preference. Compared to the previous approaches using external reward models or implicit in-context learning, we observe that the proposed approach is significantly more effective. In addition, we introduce a noise-aware preference learning algorithm to mitigate the risk of low quality within generated preference data. Our experimental results demonstrate that the proposed framework significantly boosts the alignment of LLMs. For example, we achieve superior alignment performance on AlpacaEval 2.0 with only 3.3% of the ground-truth preference labels in the Ultrafeedback data compared to the cases using the entire data or state-of-the-art baselines.

6/10/2024

cs.LG cs.AI cs.CL

👀

Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement

Xiyao Wang, Jiuhai Chen, Zhaoyang Wang, Yuhang Zhou, Yiyang Zhou, Huaxiu Yao, Tianyi Zhou, Tom Goldstein, Parminder Bhatia, Furong Huang, Cao Xiao

Large vision-language models (LVLMs) have achieved impressive results in various visual question-answering and reasoning tasks through vision instruction tuning on specific datasets. However, there is still significant room for improvement in the alignment between visual and language modalities. Previous methods to enhance this alignment typically require external models or data, heavily depending on their capabilities and quality, which inevitably sets an upper bound on performance. In this paper, we propose SIMA, a framework that enhances visual and language modality alignment through self-improvement, eliminating the needs for external models or data. SIMA leverages prompts from existing vision instruction tuning datasets to self-generate responses and employs an in-context self-critic mechanism to select response pairs for preference tuning. The key innovation is the introduction of three vision metrics during the in-context self-critic process, which can guide the LVLM in selecting responses that enhance image comprehension. Through experiments across 14 hallucination and comprehensive benchmarks, we demonstrate that SIMA not only improves model performance across all benchmarks but also achieves superior modality alignment, outperforming previous approaches.

6/11/2024

cs.CV cs.AI cs.CL cs.LG

💬

Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation

Xun Wu, Shaohan Huang, Furu Wei

Recent studies have demonstrated the exceptional potentials of leveraging human preference datasets to refine text-to-image generative models, enhancing the alignment between generated images and textual prompts. Despite these advances, current human preference datasets are either prohibitively expensive to construct or suffer from a lack of diversity in preference dimensions, resulting in limited applicability for instruction tuning in open-source text-to-image generative models and hinder further exploration. To address these challenges and promote the alignment of generative models through instruction tuning, we leverage multimodal large language models to create VisionPrefer, a high-quality and fine-grained preference dataset that captures multiple preference aspects. We aggregate feedback from AI annotators across four aspects: prompt-following, aesthetic, fidelity, and harmlessness to construct VisionPrefer. To validate the effectiveness of VisionPrefer, we train a reward model VP-Score over VisionPrefer to guide the training of text-to-image generative models and the preference prediction accuracy of VP-Score is comparable to human annotators. Furthermore, we use two reinforcement learning methods to supervised fine-tune generative models to evaluate the performance of VisionPrefer, and extensive experimental results demonstrate that VisionPrefer significantly improves text-image alignment in compositional image generation across diverse aspects, e.g., aesthetic, and generalizes better than previous human-preference metrics across various image distributions. Moreover, VisionPrefer indicates that the integration of AI-generated synthetic data as a supervisory signal is a promising avenue for achieving improved alignment with human preferences in vision generative models.

4/24/2024

cs.CV cs.MM

👀

Calibrated Self-Rewarding Vision Language Models

Yiyang Zhou, Zhiyuan Fan, Dongjie Cheng, Sihan Yang, Zhaorun Chen, Chenhang Cui, Xiyao Wang, Yun Li, Linjun Zhang, Huaxiu Yao

Large Vision-Language Models (LVLMs) have made substantial progress by integrating pre-trained large language models (LLMs) and vision models through instruction tuning. Despite these advancements, LVLMs often exhibit the hallucination phenomenon, where generated text responses appear linguistically plausible but contradict the input image, indicating a misalignment between image and text pairs. This misalignment arises because the model tends to prioritize textual information over visual input, even when both the language model and visual representations are of high quality. Existing methods leverage additional models or human annotations to curate preference data and enhance modality alignment through preference optimization. These approaches may not effectively reflect the target LVLM's preferences, making the curated preferences easily distinguishable. Our work addresses these challenges by proposing the Calibrated Self-Rewarding (CSR) approach, which enables the model to self-improve by iteratively generating candidate responses, evaluating the reward for each response, and curating preference data for fine-tuning. In the reward modeling, we employ a step-wise strategy and incorporate visual constraints into the self-rewarding process to place greater emphasis on visual input. Empirical results demonstrate that CSR enhances performance and reduces hallucinations across ten benchmarks and tasks, achieving substantial improvements over existing methods by 7.62%. Our empirical results are further supported by rigorous theoretical analysis, under mild assumptions, verifying the effectiveness of introducing visual constraints into the self-rewarding paradigm. Additionally, CSR shows compatibility with different vision-language models and the ability to incrementally improve performance through iterative fine-tuning. Our data and code are available at https://github.com/YiyangZhou/CSR.

6/3/2024

cs.LG cs.CL cs.CV