Needle In A Video Haystack: A Scalable Synthetic Framework for Benchmarking Video MLLMs

2406.09367

Published 6/14/2024 by Zijia Zhao, Haoyu Lu, Yuqi Huo, Yifan Du, Tongtian Yue, Longteng Guo, Bingning Wang, Weipeng Chen, Jing Liu

cs.CV

✅

Abstract

Video understanding is a crucial next step for multimodal large language models (MLLMs). To probe specific aspects of video understanding ability, existing video benchmarks typically require careful video selection based on the target capability, along with laborious annotation of query-response pairs to match the specific video content. This process is both challenging and resource-intensive. In this paper, we propose VideoNIAH (Video Needle In A Haystack), a benchmark construction framework through synthetic video generation. VideoNIAH decouples test video content from their query-responses by inserting unrelated image/text 'needles' into original videos. It generates annotations solely from these needles, ensuring diversity in video sources and a variety of query-responses. Additionally, by inserting multiple needles, VideoNIAH rigorously evaluates the temporal understanding capabilities of models. We utilized VideoNIAH to compile a video benchmark VNBench, including tasks such as retrieval, ordering, and counting. VNBench can efficiently evaluate the fine-grained understanding ability and spatio-temporal modeling ability of a video model, while also supporting the long-context evaluation. Additionally, we evaluated recent video-centric multimodal large language models (MLLMs), both open-source and proprietary, providing a comprehensive analysis. We found that although proprietary models have significant advantages over open-source models, all existing video models still perform poorly on long-distance dependency tasks. VideoNIAH is a simple yet highly scalable benchmark construction framework, and we believe it will inspire future video benchmark works. The code and data are available at https://github.com/joez17/VideoNIAH.

Create account to get full access

Overview

This paper proposes a new benchmark called VideoNIAH for evaluating the video understanding capabilities of large language models (LLMs).
VideoNIAH generates synthetic videos by inserting unrelated "needles" (images or text) into original videos, and then creates annotations based solely on these inserted needles.
This approach allows for greater diversity in video sources and query-response pairs, while also rigorously testing the models' temporal understanding abilities by including multiple needles.
The authors used VideoNIAH to create a benchmark called VNBench, which includes tasks like retrieval, ordering, and counting.
They evaluated recent video-centric multimodal LLMs, both open-source and proprietary, on VNBench, finding that while proprietary models outperform open-source ones, all existing video models still struggle with long-distance dependency tasks.

Plain English Explanation

Understanding video content is an important next step for large language models (LLMs) that can process text, images, and other modalities. However, creating benchmarks to test this capability is challenging. Existing video benchmarks typically require carefully selecting videos and manually annotating them to match specific capabilities.

The researchers in this paper propose a new approach called VideoNIAH that simplifies the benchmark creation process. Instead of using real videos, VideoNIAH generates synthetic videos by inserting unrelated "needles" (images or text) into original videos. It then creates annotations based solely on these inserted needles, ensuring diversity in video sources and query-response pairs.

By including multiple needles in each video, VideoNIAH also tests the models' ability to understand the temporal relationships between different parts of the video - a key aspect of video understanding. The authors used VideoNIAH to create a benchmark called VNBench, which includes tasks like retrieving relevant information, ordering events, and counting objects in the video.

When the researchers evaluated recent video-centric multimodal LLMs on VNBench, they found that proprietary models outperformed open-source ones. However, all the models still struggled with tasks that require understanding long-distance dependencies in the video content.

Overall, VideoNIAH is a simple yet scalable framework for creating video benchmarks that can efficiently evaluate the fine-grained understanding and spatio-temporal modeling abilities of video models. The researchers hope it will inspire future work in this area, leading to more comprehensive multi-modal video understanding benchmarks and models that can handle extremely long videos.

Technical Explanation

The key innovation in this paper is the VideoNIAH benchmark construction framework, which uses synthetic video generation to decouple test video content from their query-response annotations. Instead of carefully selecting and annotating real videos, VideoNIAH inserts unrelated "needles" (images or text) into original videos, and then generates annotations solely based on these inserted needles.

This approach offers several advantages:

Diversity: By using a wide range of video sources and generating annotations from the needles, VideoNIAH can create a diverse set of query-response pairs, unlike traditional benchmarks that rely on a limited number of carefully curated videos.
Scalability: The synthetic video generation and automated annotation process make VideoNIAH highly scalable, allowing for the creation of large-scale benchmarks.
Temporal Understanding: By inserting multiple needles into each video, VideoNIAH tests the models' ability to understand the temporal relationships between different parts of the video, a crucial aspect of video understanding.

Using the VideoNIAH framework, the authors compiled a video benchmark called VNBench, which includes tasks such as retrieval, ordering, and counting. They then evaluated several recent video-centric multimodal LLMs, both open-source and proprietary, on VNBench.

The results showed that proprietary models outperformed open-source models, suggesting that larger datasets and more sophisticated training approaches can lead to better video understanding capabilities. However, the authors found that all the models still struggled with tasks that require understanding long-distance dependencies in the video content, indicating the need for further research and development in this area.

Critical Analysis

The VideoNIAH framework and VNBench benchmark proposed in this paper address an important challenge in video understanding for multimodal LLMs. By decoupling video content from query-response annotations, the authors have created a scalable and efficient way to evaluate the fine-grained understanding and spatio-temporal modeling abilities of video models.

One potential limitation of the approach is that the synthetic videos, while diverse, may not fully capture the complexity and nuance of real-world video content. The authors acknowledge this and suggest that VideoNIAH should be used in conjunction with existing video benchmarks to provide a more comprehensive evaluation of video understanding capabilities.

Additionally, the authors' finding that existing video models struggle with long-distance dependency tasks highlights the need for further research and development in this area. Incorporating long-video understanding and question-answering capabilities into multimodal LLMs could be a fruitful area for future work.

Overall, the VideoNIAH framework and VNBench benchmark represent a significant contribution to the field of video understanding, and the authors' comprehensive evaluation of existing video models provides valuable insights for the development of more robust and capable multimodal LLMs.

Conclusion

This paper proposes a novel benchmark construction framework called VideoNIAH that simplifies the creation of video benchmarks for evaluating the video understanding capabilities of large language models. By generating synthetic videos with inserted "needles" and automating the annotation process, VideoNIAH overcomes the challenges of traditional video benchmarks and allows for the creation of large-scale, diverse, and temporally-aware test sets.

The authors used VideoNIAH to create the VNBench benchmark, which they used to evaluate several recent video-centric multimodal LLMs. While their results showed that proprietary models outperform open-source ones, all the evaluated models still struggle with tasks that require understanding long-distance dependencies in video content.

VideoNIAH is a significant step forward in the development of video understanding benchmarks, and the authors' findings highlight the need for further research and development in this area. As multimodal LLMs continue to advance, tools like VideoNIAH will be crucial for driving progress and ensuring that these models can truly understand and reason about the rich and complex world of video content.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Needle In A Multimodal Haystack

Weiyun Wang, Shuibo Zhang, Yiming Ren, Yuchen Duan, Tiantong Li, Shuo Liu, Mengkang Hu, Zhe Chen, Kaipeng Zhang, Lewei Lu, Xizhou Zhu, Ping Luo, Yu Qiao, Jifeng Dai, Wenqi Shao, Wenhai Wang

With the rapid advancement of multimodal large language models (MLLMs), their evaluation has become increasingly comprehensive. However, understanding long multimodal content, as a foundational ability for real-world applications, remains underexplored. In this work, we present Needle In A Multimodal Haystack (MM-NIAH), the first benchmark specifically designed to systematically evaluate the capability of existing MLLMs to comprehend long multimodal documents. Our benchmark includes three types of evaluation tasks: multimodal retrieval, counting, and reasoning. In each task, the model is required to answer the questions according to different key information scattered throughout the given multimodal document. Evaluating the leading MLLMs on MM-NIAH, we observe that existing models still have significant room for improvement on these tasks, especially on vision-centric evaluation. We hope this work can provide a platform for further research on long multimodal document comprehension and contribute to the advancement of MLLMs. Code and benchmark are released at https://github.com/OpenGVLab/MM-NIAH.

6/12/2024

cs.CV cs.AI

Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of Multimodal Large Language Models

Hengyi Wang, Haizhou Shi, Shiwei Tan, Weiyi Qin, Wenyuan Wang, Tunyu Zhang, Akshay Nambi, Tanuja Ganu, Hao Wang

Multimodal Large Language Models (MLLMs) have shown significant promise in various applications, leading to broad interest from researchers and practitioners alike. However, a comprehensive evaluation of their long-context capabilities remains underexplored. To address these gaps, we introduce the MultiModal Needle-in-a-haystack (MMNeedle) benchmark, specifically designed to assess the long-context capabilities of MLLMs. Besides multi-image input, we employ image stitching to further increase the input context length, and develop a protocol to automatically generate labels for sub-image level retrieval. Essentially, MMNeedle evaluates MLLMs by stress-testing their capability to locate a target sub-image (needle) within a set of images (haystack) based on textual instructions and descriptions of image contents. This setup necessitates an advanced understanding of extensive visual contexts and effective information retrieval within long-context image inputs. With this benchmark, we evaluate state-of-the-art MLLMs, encompassing both API-based and open-source models. The findings reveal that GPT-4o consistently surpasses other models in long-context scenarios, but suffers from hallucination problems in negative samples, i.e., when needles are not in the haystacks. Our comprehensive long-context evaluation of MLLMs also sheds lights on the considerable performance gap between API-based and open-source models. All the code, data, and instructions required to reproduce the main results are available at https://github.com/Wang-ML-Lab/multimodal-needle-in-a-haystack.

6/18/2024

cs.LG cs.AI cs.CL cs.CV

Long Context Transfer from Language to Vision

Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, Ziwei Liu

Video sequences offer valuable temporal information, but existing large multimodal models (LMMs) fall short in understanding extremely long videos. Many works address this by reducing the number of visual tokens using visual resamplers. Alternatively, in this paper, we approach this problem from the perspective of the language model. By simply extrapolating the context length of the language backbone, we enable LMMs to comprehend orders of magnitude more visual tokens without any video training. We call this phenomenon long context transfer and carefully ablate its properties. To effectively measure LMMs' ability to generalize to long contexts in the vision modality, we develop V-NIAH (Visual Needle-In-A-Haystack), a purely synthetic long vision benchmark inspired by the language model's NIAH test. Our proposed Long Video Assistant (LongVA) can process 2000 frames or over 200K visual tokens without additional complexities. With its extended context length, LongVA achieves state-of-the-art performance on Video-MME among 7B-scale models by densely sampling more input frames. Our work is open-sourced at https://github.com/EvolvingLMMs-Lab/LongVA.

7/2/2024

cs.CV

MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding

Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua Lin, Kai Chen

The advent of large vision-language models (LVLMs) has spurred research into their applications in multi-modal contexts, particularly in video understanding. Traditional VideoQA benchmarks, despite providing quantitative metrics, often fail to encompass the full spectrum of video content and inadequately assess models' temporal comprehension. To address these limitations, we introduce MMBench-Video, a quantitative benchmark designed to rigorously evaluate LVLMs' proficiency in video understanding. MMBench-Video incorporates lengthy videos from YouTube and employs free-form questions, mirroring practical use cases. The benchmark is meticulously crafted to probe the models' temporal reasoning skills, with all questions human-annotated according to a carefully constructed ability taxonomy. We employ GPT-4 for automated assessment, demonstrating superior accuracy and robustness over earlier LLM-based evaluations. Utilizing MMBench-Video, we have conducted comprehensive evaluations that include both proprietary and open-source LVLMs for images and videos. MMBench-Video stands as a valuable resource for the research community, facilitating improved evaluation of LVLMs and catalyzing progress in the field of video understanding. The evalutation code of MMBench-Video will be integrated into VLMEvalKit: https://github.com/open-compass/VLMEvalKit.

6/21/2024

cs.CV cs.MM