MANTIS: Interleaved Multi-Image Instruction Tuning

Read original: arXiv:2405.01483 - Published 5/27/2024 by Dongfu Jiang, Xuan He, Huaye Zeng, Cong Wei, Max Ku, Qian Liu, Wenhu Chen

🛸

Overview

Researchers have developed large multimodal models (LMMs) that can effectively solve single-image vision language tasks.
However, the ability of existing multi-image LMMs to solve multi-image visual language tasks is still limited.
Current multi-image LMMs are trained on noisy web data, which is inefficient and ineffective.
This paper presents a new approach to building strong multi-image LMMs through instruction tuning with academic-level resources.

Plain English Explanation

Researchers have created powerful AI models that can handle tasks involving a single image and text, such as describing an image or answering questions about an image. However, the ability of these models to work with multiple images and text together is still limited.

The existing multi-image AI models are usually trained on a huge amount of noisy data from the internet, which is not very efficient or effective. In this paper, the researchers wanted to find a better way to build strong multi-image AI models.

They created a dataset called Mantis-Instruct, which contains 721,000 examples from 14 different multi-image datasets. This dataset covers various skills like comparing images, understanding the timeline of events, and identifying connections between images. The researchers then used this dataset, along with some single-image datasets, to train their new AI model called Mantis.

Even though Mantis was trained using only academic-level resources (36 hours on 16 high-end GPUs), it was able to outperform the current best multi-image AI model by a significant margin on several benchmark tests. Mantis also maintained strong performance on single-image tasks, showing that this approach of instruction tuning is more effective than the current method of extensive pre-training on noisy web data.

Technical Explanation

The researchers constructed a high-quality dataset called Mantis-Instruct containing 721,000 instances from 14 multi-image datasets. This dataset was designed to cover a variety of multi-image skills, such as co-reference, reasoning, comparing, and temporal understanding.

The researchers then combined Mantis-Instruct with several single-image visual-language datasets to train their Mantis model. This allowed Mantis to handle any interleaved image-text inputs.

Evaluation on five multi-image benchmarks and eight single-image benchmarks showed that Mantis-8B, requiring only academic-level resources, can achieve state-of-the-art performance on all the multi-image benchmarks and beat the existing best multi-image LMM, Idefics2-8B, by an average of 9 absolute points. Importantly, Mantis maintained strong performance on both held-in and held-out evaluation benchmarks.

The researchers also found that Mantis can perform equivalently well on single-image benchmarks, matching the performance of other top models like CogVLM and Emu2.

Critical Analysis

The researchers acknowledge that their approach of instruction tuning on a curated dataset is much more efficient and effective than the current practice of pre-training on noisy web data. However, they do not discuss the potential limitations of this approach, such as the risk of the model becoming overly specialized on the Mantis-Instruct dataset and struggling to generalize to real-world scenarios.

Additionally, the paper does not provide much insight into the specific architectural choices and training techniques used to develop Mantis. More details on these aspects would allow for a deeper understanding of the model and its capabilities.

While the results are impressive, it would be valuable to see how Mantis performs on a wider range of multi-image tasks and datasets, as the evaluation was limited to a specific set of benchmarks.

Conclusion

This paper presents a novel approach to building strong multi-image large multimodal models (LMMs) through instruction tuning on a curated dataset, Mantis-Instruct. The researchers demonstrate that this method can outperform current state-of-the-art multi-image LMMs, while also maintaining high performance on single-image tasks.

The results suggest that instruction tuning on high-quality datasets may be a more effective strategy for developing multi-image LMMs than the current practice of pre-training on noisy web data. This could have significant implications for the field of multimodal AI, potentially leading to more reliable and versatile models for real-world applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🛸

MANTIS: Interleaved Multi-Image Instruction Tuning

Dongfu Jiang, Xuan He, Huaye Zeng, Cong Wei, Max Ku, Qian Liu, Wenhu Chen

Large multimodal models (LMMs) have shown great results in single-image vision language tasks. However, their abilities to solve multi-image visual language tasks is yet to be improved. The existing LMMs like OpenFlamingo, Emu2, Idefics gain their multi-image ability through pre-training on hundreds of millions of noisy interleaved image-text data from the web, which is neither efficient nor effective. In this paper, we aim to build strong multi-image LMMs via instruction tuning with academic-level resources. Therefore, we meticulously construct Mantis-Instruct containing 721K multi-image instruction data to train a family of models Mantis. The instruction tuning empowers Mantis with different multi-image skills like co-reference, comparison, reasoning, and temporal understanding. We evaluate Mantis on five multi-image benchmarks and seven single-image benchmarks. Mantis-SigLIP can achieve SoTA results on all the multi-image benchmarks and beat the strongest multi-image baseline, Idefics2-8B by an average of 11 absolute points. Notably, Idefics2-8B was pre-trained on 140M interleaved multi-image data, which is 200x larger than Mantis-Instruct. We observe that Mantis performs equivalently well on the held-in and held-out benchmarks, which shows its generalization ability. Notably, we found that Mantis can even match the performance of GPT-4V on multi-image benchmarks. We further evaluate Mantis on single-image benchmarks and demonstrate that Mantis also maintains a strong single-image performance on par with CogVLM and Emu2. Our results show that multi-image abilities are not necessarily gained through massive pre-training, instead, it can be gained by the low-cost instruction tuning. Our work provides new perspectives on how to improve LMMs' multi-image abilities.

5/27/2024

MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity

Yangzhou Liu, Yue Cao, Zhangwei Gao, Weiyun Wang, Zhe Chen, Wenhai Wang, Hao Tian, Lewei Lu, Xizhou Zhu, Tong Lu, Yu Qiao, Jifeng Dai

Despite the effectiveness of vision-language supervised fine-tuning in enhancing the performance of Vision Large Language Models (VLLMs). However, existing visual instruction tuning datasets include the following limitations: (1) Instruction annotation quality: despite existing VLLMs exhibiting strong performance, instructions generated by those advanced VLLMs may still suffer from inaccuracies, such as hallucinations. (2) Instructions and image diversity: the limited range of instruction types and the lack of diversity in image data may impact the model's ability to generate diversified and closer to real-world scenarios outputs. To address these challenges, we construct a high-quality, diverse visual instruction tuning dataset MMInstruct, which consists of 973K instructions from 24 domains. There are four instruction types: Judgement, Multiple-Choice, Long Visual Question Answering and Short Visual Question Answering. To construct MMInstruct, we propose an instruction generation data engine that leverages GPT-4V, GPT-3.5, and manual correction. Our instruction generation engine enables semi-automatic, low-cost, and multi-domain instruction generation at 1/6 the cost of manual construction. Through extensive experiment validation and ablation experiments, we demonstrate that MMInstruct could significantly improve the performance of VLLMs, e.g., the model fine-tuning on MMInstruct achieves new state-of-the-art performance on 10 out of 12 benchmarks. The code and data shall be available at https://github.com/yuecao0119/MMInstruct.

8/9/2024

MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs

Yusu Qian, Hanrong Ye, Jean-Philippe Fauconnier, Peter Grasch, Yinfei Yang, Zhe Gan

We introduce MIA-Bench, a new benchmark designed to evaluate multimodal large language models (MLLMs) on their ability to strictly adhere to complex instructions. Our benchmark comprises a diverse set of 400 image-prompt pairs, each crafted to challenge the models' compliance with layered instructions in generating accurate responses that satisfy specific requested patterns. Evaluation results from a wide array of state-of-the-art MLLMs reveal significant variations in performance, highlighting areas for improvement in instruction fidelity. Additionally, we create extra training data and explore supervised fine-tuning to enhance the models' ability to strictly follow instructions without compromising performance on other tasks. We hope this benchmark not only serves as a tool for measuring MLLM adherence to instructions, but also guides future developments in MLLM training methods.

7/29/2024

Towards Robust Instruction Tuning on Multimodal Large Language Models

Wei Han, Hui Chen, Soujanya Poria

Fine-tuning large language models (LLMs) on multi-task instruction-following data has been proven to be a powerful learning paradigm for improving their zero-shot capabilities on new tasks. Recent works about high-quality instruction-following data generation and selection require amounts of human labor to conceive model-understandable instructions for the given tasks and carefully filter the LLM-generated data. In this work, we introduce an automatic instruction augmentation method named INSTRAUG in multimodal tasks. It starts from a handful of basic and straightforward meta instructions but can expand an instruction-following dataset by 30 times. Results on two popular multimodal instructionfollowing benchmarks MULTIINSTRUCT and InstructBLIP show that INSTRAUG can significantly improve the alignment of multimodal large language models (MLLMs) across 12 multimodal tasks, which is even equivalent to the benefits of scaling up training data multiple times.

6/17/2024