Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions

Read original: arXiv:2308.04152 - Published 5/28/2024 by Juncheng Li, Kaihang Pan, Zhiqi Ge, Minghe Gao, Wei Ji, Wenqiao Zhang, Tat-Seng Chua, Siliang Tang, Hanwang Zhang, Yueting Zhuang

🏅

Overview

Researchers have developed a new module called Visual Prompt Generator Complete (VPG-C) to address limitations in existing Multimodal Large Language Models (MLLMs)
VPG-C can infer and complete missing details essential for understanding demonstrative instructions, which consist of multiple, interleaved, and multimodal instructions that demonstrate the required context to complete a task
The researchers propose a synthetic discriminative training strategy to fine-tune VPG-C, eliminating the need for supervised demonstrative instructions
They also introduce DEMON, a new benchmark for evaluating demonstrative instruction understanding

Plain English Explanation

Multimodal Large Language Models (MLLMs) are AI systems that can understand and generate language while also processing visual information. To enable this, researchers have been using Visual Prompt Generators (VPGs) to convert visual features into tokens that the language models can recognize.

However, the current approach of training VPGs on image-caption pairs has a limitation - the VPGs tend to focus only on the primary visual content that's necessary for generating captions, and often miss other important visual details. This can be a problem when trying to understand complex demonstrative instructions, which involve multiple steps and require understanding the full context of the visual information.

To address this, the researchers have developed a new module called VPG-C (Visual Prompt Generator Complete). VPG-C can infer and fill in the missing visual details that are essential for comprehending these demonstrative instructions. They've also proposed a new training strategy that uses synthetic data, rather than needing real-world demonstrative instructions, to fine-tune VPG-C.

Additionally, the researchers have created a new benchmark called DEMON to evaluate how well systems can understand demonstrative instructions. When tested on DEMON, as well as other benchmarks like MME and OwlEval, VPG-C has shown significantly stronger performance compared to other approaches.

Technical Explanation

The researchers have developed a new module called Visual Prompt Generator Complete (VPG-C) to address the limitations of existing Multimodal Large Language Models (MLLMs) in understanding demonstrative instructions.

Demonstrative instructions consist of multiple, interleaved, and multimodal steps that demonstrate the required context to complete a task. Current image-captioning based training of VPGs tends to focus only on the primary visual contents, often neglecting other important visual details. This results in MLLMs underperforming when it comes to comprehending these complex demonstrative instructions.

To address this issue, the researchers propose VPG-C, which can infer and complete the missing visual details essential for understanding demonstrative instructions. They also introduce a synthetic discriminative training strategy to fine-tune VPG-C, eliminating the need for supervised demonstrative instructions.

For evaluation, the researchers build DEMON, a comprehensive benchmark for demonstrative instruction understanding. When evaluated on DEMON, as well as other benchmarks like MME and OwlEval, the synthetically trained VPG-C achieves significantly stronger zero-shot performance across all tasks.

The researchers also provide their benchmark, code, and pre-trained models in a public repository at https://github.com/DCDmllm/Cheetah.

Critical Analysis

The researchers have presented an innovative solution to address the limitations of existing Multimodal Large Language Models in understanding complex demonstrative instructions. By introducing VPG-C and the synthetic discriminative training strategy, they have demonstrated a way to improve the visual understanding capabilities of these models.

One potential limitation of the approach is that the synthetic training data, while effective, may not fully capture the nuances and complexities of real-world demonstrative instructions. It would be interesting to see how VPG-C performs when evaluated on a more diverse set of real-world demonstrative instructions, beyond the DEMON benchmark.

Additionally, the researchers mention that VPG-C is a "generic and lightweight" module, but it would be helpful to have more details on its computational requirements and how it integrates with existing MLLM architectures. This information could help researchers and practitioners assess the practical feasibility of deploying VPG-C in real-world applications.

Overall, the research represents an important step forward in improving the multimodal understanding capabilities of large language models, and the availability of the benchmark, code, and pre-trained models is a valuable contribution to the field.

Conclusion

This research has introduced a new module called Visual Prompt Generator Complete (VPG-C) to address the limitations of existing Multimodal Large Language Models (MLLMs) in understanding complex demonstrative instructions. VPG-C can infer and complete the missing visual details essential for comprehending these instructions, and the researchers have proposed a synthetic discriminative training strategy to fine-tune the module without the need for supervised demonstrative instructions.

The researchers have also created a new benchmark, DEMON, to evaluate the performance of systems in understanding demonstrative instructions, and they have shown that VPG-C achieves significantly stronger zero-shot performance across all tasks in DEMON, as well as on other benchmarks like MME and OwlEval.

This work represents an important advancement in improving the multimodal understanding capabilities of large language models, and the availability of the benchmark, code, and pre-trained models will be valuable resources for the research community.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🏅

Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions

Juncheng Li, Kaihang Pan, Zhiqi Ge, Minghe Gao, Wei Ji, Wenqiao Zhang, Tat-Seng Chua, Siliang Tang, Hanwang Zhang, Yueting Zhuang

Recent advancements in Multimodal Large Language Models (MLLMs) have been utilizing Visual Prompt Generators (VPGs) to convert visual features into tokens that LLMs can recognize. This is achieved by training the VPGs on millions of image-caption pairs, where the VPG-generated tokens of images are fed into a frozen LLM to generate the corresponding captions. However, this image-captioning based training objective inherently biases the VPG to concentrate solely on the primary visual contents sufficient for caption generation, often neglecting other visual details. This shortcoming results in MLLMs' underperformance in comprehending demonstrative instructions consisting of multiple, interleaved, and multimodal instructions that demonstrate the required context to complete a task. To address this issue, we introduce a generic and lightweight Visual Prompt Generator Complete module (VPG-C), which can infer and complete the missing details essential for comprehending demonstrative instructions. Further, we propose a synthetic discriminative training strategy to fine-tune VPG-C, eliminating the need for supervised demonstrative instructions. As for evaluation, we build DEMON, a comprehensive benchmark for demonstrative instruction understanding. Synthetically trained with the proposed strategy, VPG-C achieves significantly stronger zero-shot performance across all tasks of DEMON. Further evaluation on the MME and OwlEval benchmarks also demonstrate the superiority of VPG-C. Our benchmark, code, and pre-trained models are available at https://github.com/DCDmllm/Cheetah.

5/28/2024

Enhancing Multimodal Large Language Models with Multi-instance Visual Prompt Generator for Visual Representation Enrichment

Wenliang Zhong, Wenyi Wu, Qi Li, Rob Barton, Boxin Du, Shioulin Sam, Karim Bouyarmane, Ismail Tutar, Junzhou Huang

Multimodal Large Language Models (MLLMs) have achieved SOTA performance in various visual language tasks by fusing the visual representations with LLMs leveraging some visual adapters. In this paper, we first establish that adapters using query-based Transformers such as Q-former is a simplified Multi-instance Learning method without considering instance heterogeneity/correlation. We then propose a general component termed Multi-instance Visual Prompt Generator (MIVPG) to incorporate enriched visual representations into LLMs by taking advantage of instance correlation between images or patches for the same sample. Quantatitive evaluation on three public vision-language (VL) datasets from different scenarios shows that the proposed MIVPG improves Q-former in main VL tasks.

6/6/2024

Meta-Prompting for Automating Zero-shot Visual Recognition with LLMs

M. Jehanzeb Mirza, Leonid Karlinsky, Wei Lin, Sivan Doveh, Jakub Micorek, Mateusz Kozinski, Hilde Kuehne, Horst Possegger

Prompt ensembling of Large Language Model (LLM) generated category-specific prompts has emerged as an effective method to enhance zero-shot recognition ability of Vision-Language Models (VLMs). To obtain these category-specific prompts, the present methods rely on hand-crafting the prompts to the LLMs for generating VLM prompts for the downstream tasks. However, this requires manually composing these task-specific prompts and still, they might not cover the diverse set of visual concepts and task-specific styles associated with the categories of interest. To effectively take humans out of the loop and completely automate the prompt generation process for zero-shot recognition, we propose Meta-Prompting for Visual Recognition (MPVR). Taking as input only minimal information about the target task, in the form of its short natural language description, and a list of associated class labels, MPVR automatically produces a diverse set of category-specific prompts resulting in a strong zero-shot classifier. MPVR generalizes effectively across various popular zero-shot image recognition benchmarks belonging to widely different domains when tested with multiple LLMs and VLMs. For example, MPVR obtains a zero-shot recognition improvement over CLIP by up to 19.8% and 18.2% (5.0% and 4.5% on average over 20 datasets) leveraging GPT and Mixtral LLMs, respectively

8/9/2024

Visual Prompting in Multimodal Large Language Models: A Survey

Junda Wu, Zhehao Zhang, Yu Xia, Xintong Li, Zhaoyang Xia, Aaron Chang, Tong Yu, Sungchul Kim, Ryan A. Rossi, Ruiyi Zhang, Subrata Mitra, Dimitris N. Metaxas, Lina Yao, Jingbo Shang, Julian McAuley

Multimodal large language models (MLLMs) equip pre-trained large-language models (LLMs) with visual capabilities. While textual prompting in LLMs has been widely studied, visual prompting has emerged for more fine-grained and free-form visual instructions. This paper presents the first comprehensive survey on visual prompting methods in MLLMs, focusing on visual prompting, prompt generation, compositional reasoning, and prompt learning. We categorize existing visual prompts and discuss generative methods for automatic prompt annotations on the images. We also examine visual prompting methods that enable better alignment between visual encoders and backbone LLMs, concerning MLLM's visual grounding, object referring, and compositional reasoning abilities. In addition, we provide a summary of model training and in-context learning methods to improve MLLM's perception and understanding of visual prompts. This paper examines visual prompting methods developed in MLLMs and provides a vision of the future of these methods.

9/25/2024