VidCoM: Fast Video Comprehension through Large Language Models with Multimodal Tools

2310.10586

Published 4/30/2024 by Ji Qi, Kaixuan Ji, Jifan Yu, Duokang Wang, Bin Xu, Lei Hou, Juanzi Li

💬

Abstract

Building models that comprehends videos and responds specific user instructions is a practical and challenging topic, as it requires mastery of both vision understanding and knowledge reasoning. Compared to language and image modalities, training efficiency remains a serious problem as existing studies train models on massive sparse videos paired with brief descriptions. In this paper, we introduce textbf{VidCoM}, a fast adaptive framework that leverages Large Language Models (LLMs) to reason about videos using lightweight visual tools. Specifically, we reveal that the key to responding to specific instructions is focusing on relevant video events, and utilize two visual tools, structured scene graph generation and descriptive image caption generation, to gather and represent the event information. Thus, a LLM enriched with world knowledge is adopted as the reasoning agent to achieve the responses by performing multiple reasoning steps on specific video events. To address the difficulty of LLMs identifying video events, we further propose an Instruction-oriented Video Events Recognition (InsOVER) algorithm. This algorithm locates the corresponding video events based on an efficient Hungarian matching between decompositions of linguistic instructions and video events, thereby enabling LLMs to interact effectively with extended videos. Extensive experiments on two typical video comprehension tasks show that the proposed tuning-free framework outperforms the pre-trained models including Flamingo-80B, to achieve the state-of-the-art performance. Our source code and system will be publicly available.

Create account to get full access

Overview

This paper introduces VidCoM, a framework for video comprehension that leverages Large Language Models (LLMs) and lightweight visual tools to respond to specific user instructions.
The key insight is that effectively responding to instructions requires focusing on relevant video events, which can be captured using structured scene graph generation and descriptive image captioning.
The paper also proposes an Instruction-oriented Video Events Recognition (InsOVER) algorithm to help LLMs identify the relevant video events and interact more effectively with extended videos.
Experiments show that VidCoM outperforms pre-trained models like Flamingo-80B on video comprehension tasks, achieving state-of-the-art performance without the need for fine-tuning.

Plain English Explanation

The paper discusses a new framework called VidCoM that aims to help machines better understand and respond to instructions related to videos. Compared to understanding language or images, video comprehension is particularly challenging because the information is spread out over time and can be quite complex.

The key insight is that to respond effectively to instructions, the system needs to focus on the relevant events happening in the video. The researchers use two visual tools to gather and represent this event information: structured scene graph generation and descriptive image caption generation. This allows a large language model, enriched with world knowledge, to reason about the video and generate appropriate responses by performing multiple steps of reasoning on the specific events.

To help the language model identify the relevant video events, the researchers also propose a new algorithm called InsOVER. This algorithm matches the linguistic instructions to the corresponding video events in an efficient way, enabling the language model to interact more effectively with extended videos.

Through extensive experiments, the researchers show that their VidCoM framework outperforms pre-trained models like Flamingo-80B on video comprehension tasks, achieving state-of-the-art performance without the need for fine-tuning.

Technical Explanation

The VidCoM framework leverages Large Language Models (LLMs) to reason about videos using lightweight visual tools. The key insight is that the ability to respond to specific instructions requires focusing on relevant video events. To capture this event information, the researchers use two visual tools:

Structured scene graph generation: This tool generates a structured representation of the objects, relationships, and actions present in the video frames.
Descriptive image caption generation: This tool generates natural language descriptions of the key elements in the video frames.

The LLM, enriched with world knowledge, is then adopted as the reasoning agent. It performs multiple reasoning steps on the specific video events represented by the scene graphs and captions to generate the appropriate responses.

To address the challenge of LLMs identifying relevant video events, the researchers propose the Instruction-oriented Video Events Recognition (InsOVER) algorithm. This algorithm efficiently matches the linguistic instructions to the corresponding video events by decomposing both the instructions and video events and performing a Hungarian matching between them.

The researchers conduct extensive experiments on two typical video comprehension tasks, demonstrating that their VidCoM framework outperforms pre-trained models, including Flamingo-80B, to achieve state-of-the-art performance without the need for fine-tuning.

Critical Analysis

The researchers acknowledge several limitations and areas for further research in their paper:

The current framework relies on the availability of pre-trained visual tools for scene graph generation and image captioning, which may not always be the case. Exploring end-to-end training of these visual components could further improve the framework's flexibility and performance.
The experiments focus on relatively short videos, and the researchers note that extending the framework to handle longer, more complex videos remains an open challenge.
The paper does not provide a detailed analysis of the types of instructions or video scenarios that the framework performs best or worst on. Investigating these edge cases could help identify areas for improvement.

Additionally, one could question whether the reliance on visual tools like scene graph generation and image captioning, which may introduce their own biases and errors, is the optimal approach. Exploring alternative ways of representing and reasoning about video events could be a fruitful direction for future research.

Overall, the VidCoM framework represents an interesting and promising step towards improving video comprehension capabilities, but there is still room for further refinement and exploration to make the system more robust and generalizable.

Conclusion

This paper introduces VidCoM, a framework that leverages Large Language Models (LLMs) and lightweight visual tools to effectively respond to specific user instructions related to videos. The key insight is that the ability to respond to instructions requires focusing on the relevant video events, which can be captured using structured scene graph generation and descriptive image captioning.

The proposed InsOVER algorithm helps LLMs identify the relevant video events, enabling them to interact more effectively with extended videos. Extensive experiments show that the VidCoM framework outperforms pre-trained models, including Flamingo-80B, on video comprehension tasks, achieving state-of-the-art performance without the need for fine-tuning.

While the framework has some limitations, such as its reliance on pre-trained visual tools and its focus on shorter videos, this research represents an important step forward in improving video comprehension capabilities. Further refinements and explorations of alternative approaches could lead to even more robust and versatile systems for understanding and responding to complex video content.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

LongVLM: Efficient Long Video Understanding via Large Language Models

Yuetian Weng, Mingfei Han, Haoyu He, Xiaojun Chang, Bohan Zhuang

Empowered by Large Language Models (LLMs), recent advancements in VideoLLMs have driven progress in various video understanding tasks. These models encode video representations through pooling or query aggregation over a vast number of visual tokens, making computational and memory costs affordable. Despite successfully providing an overall comprehension of video content, existing VideoLLMs still face challenges in achieving detailed understanding in videos due to overlooking local information in long-term videos. To tackle this challenge, we introduce LongVLM, a straightforward yet powerful VideoLLM for long video understanding, building upon the observation that long videos often consist of sequential key events, complex actions, and camera movements. Our approach proposes to decompose long videos into multiple short-term segments and encode local features for each local segment via a hierarchical token merging module. These features are concatenated in temporal order to maintain the storyline across sequential short-term segments. Additionally, we propose to integrate global semantics into each local feature to enhance context understanding. In this way, we encode video representations that incorporate both local and global information, enabling the LLM to generate comprehensive responses for long-term videos. Experimental results on the VideoChatGPT benchmark and zero-shot video question-answering datasets demonstrate the superior capabilities of our model over the previous state-of-the-art methods. Qualitative examples demonstrate that our model produces more precise responses for long videos understanding. Code will be available at https://github.com/ziplab/LongVLM.

4/11/2024

cs.CV

VoCo-LLaMA: Towards Vision Compression with Large Language Models

Xubing Ye, Yukang Gan, Xiaoke Huang, Yixiao Ge, Ying Shan, Yansong Tang

Vision-Language Models (VLMs) have achieved remarkable success in various multi-modal tasks, but they are often bottlenecked by the limited context window and high computational cost of processing high-resolution image inputs and videos. Vision compression can alleviate this problem by reducing the vision token count. Previous approaches compress vision tokens with external modules and force LLMs to understand the compressed ones, leading to visual information loss. However, the LLMs' understanding paradigm of vision tokens is not fully utilised in the compression learning process. We propose VoCo-LLaMA, the first approach to compress vision tokens using LLMs. By introducing Vision Compression tokens during the vision instruction tuning phase and leveraging attention distillation, our method distill how LLMs comprehend vision tokens into their processing of VoCo tokens. VoCo-LLaMA facilitates effective vision compression and improves the computational efficiency during the inference stage. Specifically, our method achieves minimal performance loss with a compression ratio of 576$times$, resulting in up to 94.8$%$ fewer FLOPs and 69.6$%$ acceleration in inference time. Furthermore, through continuous training using time-series compressed token sequences of video frames, VoCo-LLaMA demonstrates the ability to understand temporal correlations, outperforming previous methods on popular video question-answering benchmarks. Our approach presents a promising way to unlock the full potential of VLMs' contextual window, enabling more scalable multi-modal applications. The project page, along with the associated code, can be accessed via $href{https://yxxxb.github.io/VoCo-LLaMA-page/}{text{this https URL}}$.

6/19/2024

cs.CV

Towards Event-oriented Long Video Understanding

Yifan Du, Kun Zhou, Yuqi Huo, Yifan Li, Wayne Xin Zhao, Haoyu Lu, Zijia Zhao, Bingning Wang, Weipeng Chen, Ji-Rong Wen

With the rapid development of video Multimodal Large Language Models (MLLMs), numerous benchmarks have been proposed to assess their video understanding capability. However, due to the lack of rich events in the videos, these datasets may suffer from the short-cut bias that the answers can be deduced from a few frames, without the need to watch the entire video. To address this issue, we introduce Event-Bench, an event-oriented long video understanding benchmark built on existing datasets and human annotations. Event-Bench includes six event-related tasks and 2,190 test instances to comprehensively evaluate video event understanding ability. Additionally, we propose Video Instruction Merging~(VIM), a cost-effective method that enhances video MLLMs using merged, event-intensive video instructions, addressing the scarcity of human-annotated, event-intensive data. Extensive experiments show that the best-performing model, GPT-4o, achieves an overall accuracy of 53.33, significantly outperforming the best open-source model by 41.42%. Leveraging an effective instruction synthesis method and an adaptive model architecture, VIM surpasses both state-of-the-art open-source models and GPT-4V on the Event-Bench. All code, data, and models are publicly available at https://github.com/RUCAIBox/Event-Bench.

6/21/2024

cs.CV cs.CL cs.MM

VideoLLM-online: Online Video Large Language Model for Streaming Video

Joya Chen, Zhaoyang Lv, Shiwei Wu, Kevin Qinghong Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongxing Mao, Mike Zheng Shou

Recent Large Language Models have been enhanced with vision capabilities, enabling them to comprehend images, videos, and interleaved vision-language content. However, the learning methods of these large multimodal models typically treat videos as predetermined clips, making them less effective and efficient at handling streaming video inputs. In this paper, we propose a novel Learning-In-Video-Stream (LIVE) framework, which enables temporally aligned, long-context, and real-time conversation within a continuous video stream. Our LIVE framework comprises comprehensive approaches to achieve video streaming dialogue, encompassing: (1) a training objective designed to perform language modeling for continuous streaming inputs, (2) a data generation scheme that converts offline temporal annotations into a streaming dialogue format, and (3) an optimized inference pipeline to speed up the model responses in real-world video streams. With our LIVE framework, we built VideoLLM-online model upon Llama-2/Llama-3 and demonstrate its significant advantages in processing streaming videos. For instance, on average, our model can support streaming dialogue in a 5-minute video clip at over 10 FPS on an A100 GPU. Moreover, it also showcases state-of-the-art performance on public offline video benchmarks, such as recognition, captioning, and forecasting. The code, model, data, and demo have been made available at https://showlab.github.io/videollm-online.

6/18/2024

cs.CV