CoMM: A Coherent Interleaved Image-Text Dataset for Multimodal Understanding and Generation

2406.10462

Published 6/18/2024 by Wei Chen, Lin Li, Yongqi Yang, Bin Wen, Fan Yang, Tingting Gao, Yu Wu, Long Chen

CoMM: A Coherent Interleaved Image-Text Dataset for Multimodal Understanding and Generation

Abstract

Interleaved image-text generation has emerged as a crucial multimodal task, aiming at creating sequences of interleaved visual and textual content given a query. Despite notable advancements in recent multimodal large language models (MLLMs), generating integrated image-text sequences that exhibit narrative coherence and entity and style consistency remains challenging due to poor training data quality. To address this gap, we introduce CoMM, a high-quality Coherent interleaved image-text MultiModal dataset designed to enhance the coherence, consistency, and alignment of generated multimodal content. Initially, CoMM harnesses raw data from diverse sources, focusing on instructional content and visual storytelling, establishing a foundation for coherent and consistent content. To further refine the data quality, we devise a multi-perspective filter strategy that leverages advanced pre-trained models to ensure the development of sentences, consistency of inserted images, and semantic alignment between them. Various quality evaluation metrics are designed to prove the high quality of the filtered dataset. Meanwhile, extensive few-shot experiments on various downstream tasks demonstrate CoMM's effectiveness in significantly enhancing the in-context learning capabilities of MLLMs. Moreover, we propose four new tasks to evaluate MLLMs' interleaved generation abilities, supported by a comprehensive evaluation framework. We believe CoMM opens a new avenue for advanced MLLMs with superior multimodal in-context learning and understanding ability.

Create account to get full access

Overview

This paper introduces CoMM, a new dataset of interleaved image-text pairs designed for multimodal understanding and generation tasks.
CoMM contains over 1 million image-text pairs that are coherently connected, allowing models to learn how visual and textual information work together.
The dataset is designed to support a wide range of multimodal applications, from image captioning to visual question answering.

Plain English Explanation

The researchers who created this paper have developed a new dataset called CoMM that contains over 1 million pairs of images and text. These image-text pairs are "interleaved," meaning the text and visuals are closely connected and complement each other. This allows machine learning models to better understand how visual and textual information work together, which is important for building systems that can truly comprehend and generate multimodal content.

Unlike many existing datasets that have images and text that are more loosely related, the information in CoMM is coherent and tightly integrated. This makes it a valuable resource for training models on tasks like image captioning or visual question answering, where the model needs to understand the relationship between visual and textual elements.

The goal is for CoMM to support the development of more advanced multimodal AI systems that can seamlessly combine images, text, and other modalities to better comprehend and generate content. This could have applications in areas like multimodal search, multimodal dialogue, and even creative tasks like multimodal story generation.

Technical Explanation

The key innovation of the CoMM dataset is its "coherent interleaving" of images and text. Unlike many existing multimodal datasets where the visual and textual elements are more loosely related, the information in CoMM is carefully curated to ensure a tight coupling between the two modalities.

The dataset was constructed by gathering image-text pairs from a variety of sources and then using advanced natural language processing and computer vision techniques to identify pairs that were highly coherent and complementary. This resulted in over 1 million pairs of images and corresponding text that work together in a cohesive way.

To evaluate the quality and usefulness of the dataset, the researchers conducted a series of experiments on multimodal tasks like image captioning, visual question answering, and multimodal generation. The results showed that models trained on CoMM were able to learn more nuanced and effective multimodal representations compared to models trained on less coherent datasets.

Critical Analysis

One potential limitation of the CoMM dataset is the scope of the content it covers. While the researchers made efforts to include a diverse range of topics and genres, the dataset may still be biased towards certain domains or styles of communication. Additionally, the curation process, while meticulous, could potentially introduce its own biases.

It would be valuable to see the dataset evaluated on a wider range of multimodal tasks and applications to fully understand its strengths and weaknesses. Additionally, comparing the performance of models trained on CoMM versus other prominent multimodal datasets could provide more insight into the unique advantages of this new resource.

Finally, the researchers acknowledge that further work is needed to scale up the dataset and make it accessible to a broader community of researchers and developers. Ongoing maintenance and expansion of the dataset will be crucial to ensuring its long-term impact on the field of multimodal AI.

Conclusion

Overall, the CoMM dataset represents an important step forward in the development of coherent, interleaved multimodal datasets. By creating a resource where images and text are tightly coupled, the researchers have enabled the training of more sophisticated multimodal models that can better understand and generate content involving multiple modalities.

The potential applications of this technology are wide-ranging, from improved multimodal search to more engaging multimodal dialogue systems. As the field of multimodal AI continues to advance, datasets like CoMM will play a crucial role in powering the next generation of multimodal understanding and generation capabilities.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🤖

OmniCorpus: An Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

Qingyun Li, Zhe Chen, Weiyun Wang, Wenhai Wang, Shenglong Ye, Zhenjiang Jin, Guanzhou Chen, Yinan He, Zhangwei Gao, Erfei Cui, Jiashuo Yu, Hao Tian, Jiasheng Zhou, Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian Zhang, Bo Zhang, Pinlong Cai, Licheng Wen, Xiangchao Yan, Zhenxiang Li, Pei Chu, Yi Wang, Min Dou, Changyao Tian, Xizhou Zhu, Lewei Lu, Yushi Chen, Junjun He, Zhongying Tu, Tong Lu, Yali Wang, Limin Wang, Dahua Lin, Yu Qiao, Botian Shi, Conghui He, Jifeng Dai

Image-text interleaved data, consisting of multiple images and texts arranged in a natural document format, aligns with the presentation paradigm of internet data and closely resembles human reading habits. Recent studies have shown that such data aids multimodal in-context learning and maintains the capabilities of large language models during multimodal fine-tuning. However, the limited scale and diversity of current image-text interleaved data restrict the development of multimodal large language models. In this paper, we introduce OmniCorpus, a 10 billion-scale image-text interleaved dataset. Using an efficient data engine, we filter and extract large-scale high-quality documents, which contain 8.6 billion images and 1,696 billion text tokens. Compared to counterparts (e.g., MMC4, OBELICS), our dataset 1) has 15 times larger scales while maintaining good data quality; 2) features more diverse sources, including both English and non-English websites as well as video-centric websites; 3) is more flexible, easily degradable from an image-text interleaved format to pure text corpus and image-text pairs. Through comprehensive analysis and experiments, we validate the quality, usability, and effectiveness of the proposed dataset. We hope this could provide a solid data foundation for future multimodal model research. Code and data are released at https://github.com/OpenGVLab/OmniCorpus.

6/17/2024

cs.CV cs.AI

VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models

Chenyu Zhou, Mengdan Zhang, Peixian Chen, Chaoyou Fu, Yunhang Shen, Xiawu Zheng, Xing Sun, Rongrong Ji

The swift progress of Multi-modal Large Models (MLLMs) has showcased their impressive ability to tackle tasks blending vision and language. Yet, most current models and benchmarks cater to scenarios with a narrow scope of visual and textual contexts. These models often fall short when faced with complex comprehension tasks, which involve navigating through a plethora of irrelevant and potentially misleading information in both text and image forms. To bridge this gap, we introduce a new, more demanding task known as Interleaved Image-Text Comprehension (IITC). This task challenges models to discern and disregard superfluous elements in both images and text to accurately answer questions and to follow intricate instructions to pinpoint the relevant image. In support of this task, we further craft a new VEGA dataset, tailored for the IITC task on scientific content, and devised a subtask, Image-Text Association (ITA), to refine image-text correlation skills. Our evaluation of four leading closed-source models, as well as various open-source models using VEGA, underscores the rigorous nature of IITC. Even the most advanced models, such as Gemini-1.5-pro and GPT4V, only achieved modest success. By employing a multi-task, multi-scale post-training strategy, we have set a robust baseline for MLLMs on the IITC task, attaining an $85.8%$ accuracy rate in image association and a $0.508$ Rouge score. These results validate the effectiveness of our dataset in improving MLLMs capabilities for nuanced image-text comprehension.

6/17/2024

cs.CV cs.AI cs.CL

💬

VidCoM: Fast Video Comprehension through Large Language Models with Multimodal Tools

Ji Qi, Kaixuan Ji, Jifan Yu, Duokang Wang, Bin Xu, Lei Hou, Juanzi Li

Building models that comprehends videos and responds specific user instructions is a practical and challenging topic, as it requires mastery of both vision understanding and knowledge reasoning. Compared to language and image modalities, training efficiency remains a serious problem as existing studies train models on massive sparse videos paired with brief descriptions. In this paper, we introduce textbf{VidCoM}, a fast adaptive framework that leverages Large Language Models (LLMs) to reason about videos using lightweight visual tools. Specifically, we reveal that the key to responding to specific instructions is focusing on relevant video events, and utilize two visual tools, structured scene graph generation and descriptive image caption generation, to gather and represent the event information. Thus, a LLM enriched with world knowledge is adopted as the reasoning agent to achieve the responses by performing multiple reasoning steps on specific video events. To address the difficulty of LLMs identifying video events, we further propose an Instruction-oriented Video Events Recognition (InsOVER) algorithm. This algorithm locates the corresponding video events based on an efficient Hungarian matching between decompositions of linguistic instructions and video events, thereby enabling LLMs to interact effectively with extended videos. Extensive experiments on two typical video comprehension tasks show that the proposed tuning-free framework outperforms the pre-trained models including Flamingo-80B, to achieve the state-of-the-art performance. Our source code and system will be publicly available.

4/30/2024

cs.CV cs.CL

TextCoT: Zoom In for Enhanced Multimodal Text-Rich Image Understanding

Bozhi Luan, Hao Feng, Hong Chen, Yonghui Wang, Wengang Zhou, Houqiang Li

The advent of Large Multimodal Models (LMMs) has sparked a surge in research aimed at harnessing their remarkable reasoning abilities. However, for understanding text-rich images, challenges persist in fully leveraging the potential of LMMs, and existing methods struggle with effectively processing high-resolution images. In this work, we propose TextCoT, a novel Chain-of-Thought framework for text-rich image understanding. TextCoT utilizes the captioning ability of LMMs to grasp the global context of the image and the grounding capability to examine local textual regions. This allows for the extraction of both global and local visual information, facilitating more accurate question-answering. Technically, TextCoT consists of three stages, including image overview, coarse localization, and fine-grained observation. The image overview stage provides a comprehensive understanding of the global scene information, and the coarse localization stage approximates the image area containing the answer based on the question asked. Then, integrating the obtained global image descriptions, the final stage further examines specific regions to provide accurate answers. Our method is free of extra training, offering immediate plug-and-play functionality. Extensive experiments are conducted on a series of text-rich image question-answering benchmark datasets based on several advanced LMMs, and the results demonstrate the effectiveness and strong generalization ability of our method. Code is available at https://github.com/bzluan/TextCoT.

4/16/2024

cs.CV