Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback

2402.03746

Published 6/18/2024 by Daechul Ahn, Yura Choi, Youngjae Yu, Dongyeop Kang, Jonghyun Choi

🏅

Abstract

Recent advancements in large language models have influenced the development of video large multimodal models (VLMMs). The previous approaches for VLMMs involved Supervised Fine-Tuning (SFT) with instruction-tuned datasets, integrating LLM with visual encoders, and adding additional learnable modules. Video and text multimodal alignment remains challenging, primarily due to the deficient volume and quality of multimodal instruction-tune data compared to text-only data. We present a novel alignment strategy that employs multimodal AI system to oversee itself called Reinforcement Learning from AI Feedback (RLAIF), providing self-preference feedback to refine itself and facilitating the alignment of video and text modalities. In specific, we propose context-aware reward modeling by providing detailed video descriptions as context during the generation of preference feedback in order to enrich the understanding of video content. Demonstrating enhanced performance across diverse video benchmarks, our multimodal RLAIF approach, VLM-RLAIF, outperforms existing approaches, including the SFT model. We commit to open-sourcing our code, models, and datasets to foster further research in this area.

Create account to get full access

Overview

Recent advancements in large language models have influenced the development of video large multimodal models (VLMMs).
Previous approaches for VLMMs involved Supervised Fine-Tuning with instruction-tuned datasets, integrating LLM with visual encoders, and adding additional learnable modules.
Video and text multimodal alignment remains challenging, primarily due to the deficient volume and quality of multimodal instruction-tune data compared to text-only data.
The paper presents a novel alignment strategy called Reinforcement Learning from AI Feedback (RLAIF), where a multimodal AI system oversees itself to refine and align video and text modalities.

Plain English Explanation

Large language models (LLMs) have made significant advancements in processing and generating text. Researchers have now started exploring ways to combine LLMs with visual information, creating video large multimodal models (VLMMs). Previous attempts to develop VLMMs involved fine-tuning the models on specialized datasets or adding additional components to integrate visual and textual information.

However, aligning video and text data has proven to be a challenging task, as there is often a lack of high-quality multimodal training data compared to text-only data. To address this, the researchers propose a novel approach called Reinforcement Learning from AI Feedback (RLAIF). In this method, the VLMM system provides feedback to itself, allowing it to learn and refine its understanding of the relationship between video and text. By using detailed video descriptions as context during the feedback process, the system can better comprehend the video content and improve the alignment between the two modalities.

The researchers demonstrate that their VLM-RLAIF approach outperforms existing methods, including the traditional fine-tuning approach, on a variety of video benchmarks. This suggests that the RLAIF technique is a promising way to address the challenges of aligning video and text data, which could lead to more advanced multimodal AI systems capable of understanding and interacting with both visual and textual information.

Technical Explanation

The paper presents a novel approach called Reinforcement Learning from AI Feedback (RLAIF) to address the challenge of aligning video and text modalities in large multimodal models (VLMMs). Previous approaches, such as Supervised Fine-Tuning (SFT) with instruction-tuned datasets and integrating LLMs with visual encoders, have struggled to effectively leverage the limited multimodal data available compared to text-only data.

The key idea behind RLAIF is to have the VLMM system provide self-preference feedback to refine and align the video and text modalities. This is achieved by using detailed video descriptions as context to enrich the system's understanding of the video content during the feedback generation process. The authors call their RLAIF-based VLMM model VLM-RLAIF.

The researchers evaluate VLM-RLAIF on various video benchmarks and demonstrate that it outperforms existing approaches, including the SFT model. This suggests that the RLAIF technique is an effective way to address the video-text alignment challenge and can lead to more advanced multimodal AI systems.

Critical Analysis

The paper presents a promising approach to address the video-text alignment challenge in large multimodal models. The RLAIF technique, where the system provides self-preference feedback to refine its understanding of the relationship between video and text, is a novel and interesting idea.

One potential limitation of the research is the reliance on detailed video descriptions as context for the feedback generation. While this approach seems to improve the system's comprehension of video content, it may be challenging to obtain such detailed descriptions at scale, especially for real-world video data. Further investigation into more scalable methods of providing contextual information to the system may be warranted.

Additionally, the paper does not provide a thorough analysis of the limitations or failure cases of the VLM-RLAIF approach. It would be valuable to understand the specific scenarios where the model struggles or produces suboptimal results, as this could inform future improvements and guide the development of more robust multimodal AI systems.

Despite these potential areas for further research, the paper's demonstration of VLM-RLAIF's superior performance on video benchmarks compared to existing approaches is a promising step forward in the field of video-language multimodal modeling. The researchers' commitment to open-sourcing their code, models, and datasets is also commendable, as it will foster further exploration and advancement in this important research area.

Conclusion

The paper presents a novel Reinforcement Learning from AI Feedback (RLAIF) approach to address the video-text alignment challenge in large multimodal models (VLMMs). By having the VLMM system provide self-preference feedback, the researchers demonstrate that their VLM-RLAIF model outperforms existing methods, including Supervised Fine-Tuning (SFT), on various video benchmarks.

This research represents an important step forward in the development of advanced multimodal AI systems that can effectively integrate and align visual and textual information. The RLAIF approach offers a promising solution to the longstanding challenge of video-text alignment, which could have significant implications for a wide range of applications, from multimodal content understanding to intelligent assistants capable of seamlessly interacting with both video and text.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🏋️

Directed Domain Fine-Tuning: Tailoring Separate Modalities for Specific Training Tasks

Daniel Wen, Nafisa Hussain

Large language models (LLMs) and large visual language models (LVLMs) have been at the forefront of the artificial intelligence field, particularly for tasks like text generation, video captioning, and question-answering. Typically, it is more applicable to train these models on broader knowledge bases or datasets to increase generalizability, learn relationships between topics, and recognize patterns. Instead, we propose to provide instructional datasets specific to the task of each modality within a distinct domain and then fine-tune the parameters of the model using LORA. With our approach, we can eliminate all noise irrelevant to the given task while also ensuring that the model generates with enhanced precision. For this work, we use Video-LLaVA to generate recipes given cooking videos without transcripts. Video-LLaVA's multimodal architecture allows us to provide cooking images to its image encoder, cooking videos to its video encoder, and general cooking questions to its text encoder. Thus, we aim to remove all noise unrelated to cooking while improving our model's capabilities to generate specific ingredient lists and detailed instructions. As a result, our approach to fine-tuning Video-LLaVA leads to gains over the baseline Video-LLaVA by 2% on the YouCook2 dataset. While this may seem like a marginal increase, our model trains on an image instruction dataset 2.5% the size of Video-LLaVA's and a video instruction dataset 23.76% of Video-LLaVA's.

6/26/2024

cs.CV cs.AI

FGAIF: Aligning Large Vision-Language Models with Fine-grained AI Feedback

Liqiang Jing, Xinya Du

Large Vision-Language Models (LVLMs) have demonstrated proficiency in tackling a variety of visual-language tasks. However, current LVLMs suffer from misalignment between text and image modalities which causes three kinds of hallucination problems, i.e., object existence, object attribute, and object relationship. To tackle this issue, existing methods mainly utilize Reinforcement Learning (RL) to align modalities in LVLMs. However, they still suffer from three main limitations: (1) General feedback can not indicate the hallucination type contained in the response; (2) Sparse rewards only give the sequence-level reward for the whole response; and (3)Annotation cost is time-consuming and labor-intensive. To handle these limitations, we propose an innovative method to align modalities in LVLMs through Fine-Grained Artificial Intelligence Feedback (FGAIF), which mainly consists of three steps: AI-based Feedback Collection, Fine-grained Reward Model Training, and Reinforcement Learning with Fine-grained Reward. Specifically, We first utilize AI tools to predict the types of hallucination for each segment in the response and obtain a collection of fine-grained feedback. Then, based on the collected reward data, three specialized reward models are trained to produce dense rewards. Finally, a novel fine-grained feedback module is integrated into the Proximal Policy Optimization (PPO) algorithm. Extensive experiments are conducted on hallucination and general benchmarks, demonstrating the superior performance of our proposed method. Notably, compared with previous models trained with the RL-based aligning method, our proposed method is effective even with fewer parameters.

4/9/2024

cs.CV cs.CL

🏅

RL-VLM-F: Reinforcement Learning from Vision Language Foundation Model Feedback

Yufei Wang, Zhanyi Sun, Jesse Zhang, Zhou Xian, Erdem Biyik, David Held, Zackory Erickson

Reward engineering has long been a challenge in Reinforcement Learning (RL) research, as it often requires extensive human effort and iterative processes of trial-and-error to design effective reward functions. In this paper, we propose RL-VLM-F, a method that automatically generates reward functions for agents to learn new tasks, using only a text description of the task goal and the agent's visual observations, by leveraging feedbacks from vision language foundation models (VLMs). The key to our approach is to query these models to give preferences over pairs of the agent's image observations based on the text description of the task goal, and then learn a reward function from the preference labels, rather than directly prompting these models to output a raw reward score, which can be noisy and inconsistent. We demonstrate that RL-VLM-F successfully produces effective rewards and policies across various domains - including classic control, as well as manipulation of rigid, articulated, and deformable objects - without the need for human supervision, outperforming prior methods that use large pretrained models for reward generation under the same assumptions. Videos can be found on our project website: https://rlvlmf2024.github.io/

6/18/2024

cs.RO cs.AI cs.LG

i-SRT: Aligning Large Multimodal Models for Videos by Iterative Self-Retrospective Judgment

Daechul Ahn, Yura Choi, San Kim, Youngjae Yu, Dongyeop Kang, Jonghyun Choi

Aligning Video Large Multimodal Models (VLMMs) face challenges such as modality misalignment and verbose responses. Although iterative approaches such as self-rewarding or iterative direct preference optimization (DPO) recently showed a significant improvement in language model alignment, particularly on reasoning tasks, self-aligned models applied to large video-language models often result in lengthy and irrelevant responses. To address these challenges, we propose a novel method that employs self-retrospection to enhance both response generation and preference modeling, and call iterative self-retrospective judgment (i-SRT). By revisiting and evaluating already generated content and preference in loop, i-SRT improves the alignment between textual and visual modalities, reduce verbosity, and enhances content relevance. Our empirical evaluations across diverse video question answering benchmarks demonstrate that i-SRT significantly outperforms prior arts. We are committed to opensourcing our code, models, and datasets to encourage further investigation.

6/18/2024

cs.CV