InternVideo2: Scaling Foundation Models for Multimodal Video Understanding

Read original: arXiv:2403.15377 - Published 8/15/2024 by Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Chenting Wang, Guo Chen, Baoqi Pei, Ziang Yan, Rongkun Zheng and 10 others

InternVideo2: Scaling Foundation Models for Multimodal Video Understanding

Overview

The paper presents InternVideo2, a scalable video foundation model for multimodal video understanding.
It demonstrates how large-scale video-language pretraining can enable strong performance on a diverse set of video tasks.
The model is trained on a massive dataset of 260 million video-text pairs, making it the largest video foundation model to date.

Plain English Explanation

The paper introduces InternVideo2, a powerful AI system that can understand and process video content in a highly sophisticated way. This system has been trained on an enormous dataset of 260 million video-text pairs, which is the largest video foundation model ever created.

By training on such a vast amount of data, InternVideo2 has developed a deep understanding of the relationships between visual information, language, and the broader context of the world. This allows the model to perform exceptionally well on a wide range of video-related tasks, from captioning and question answering to action recognition and video retrieval.

The key innovation of InternVideo2 is its ability to "scale up" - that is, to leverage massive amounts of data and computational power to create an AI system that is far more capable than previous models. This scaling approach enables InternVideo2 to gain insights and make connections that would be impossible for smaller, more limited models.

Overall, the paper demonstrates how advancements in video foundation models can lead to significant breakthroughs in our ability to understand and interact with video content in a more natural, intuitive way. This has exciting implications for a wide range of applications, from assistive technology to creative and entertainment media.

Technical Explanation

The paper presents the InternVideo2 model, which is a large-scale video foundation model trained on a dataset of 260 million video-text pairs. This is the largest video foundation model to date, surpassing previous efforts in terms of both dataset size and model capacity.

The model architecture is based on a transformer-based video encoder and a text encoder, which are jointly trained to learn meaningful representations of the video and language modalities. The video encoder processes the visual input, while the text encoder handles the accompanying textual information, such as captions or descriptions.

Through this multimodal pretraining, the model learns to effectively capture and integrate the relationships between visual and linguistic information, enabling it to perform well on a diverse range of video understanding tasks. These include video captioning, video question answering, action recognition, and video retrieval.

The paper presents extensive experiments demonstrating the superior performance of InternVideo2 compared to previous state-of-the-art models. The authors also analyze the scaling behavior of the model, showing that its performance continues to improve as the dataset and model size are increased.

Critical Analysis

The paper acknowledges several limitations and areas for future research. For instance, the model is primarily focused on understanding existing video content and may not be as effective at generating or manipulating video data.

Additionally, the dataset used for pretraining, while massive, may still be biased or limited in its coverage of the real-world diversity of video content. The authors suggest that further research is needed to address these biases and ensure the model's robustness across a wider range of scenarios.

Another potential concern is the significant computational and energy resources required to train and deploy such a large-scale model. The environmental impact of these foundation models is an important consideration that warrants further investigation.

Despite these caveats, the paper presents a compelling case for the value of scaling up video foundation models. The impressive performance gains and versatility demonstrated by InternVideo2 suggest that this line of research could lead to transformative advancements in our ability to understand and interact with video content.

Conclusion

The paper introduces InternVideo2, a state-of-the-art video foundation model that has been trained on an unprecedented 260 million video-text pairs. By leveraging this massive dataset and a powerful transformer-based architecture, the model achieves impressive performance on a wide range of video understanding tasks.

This work highlights the potential of scaling up video foundation models to unlock new capabilities in areas like assistive technology, creative media, and entertainment. While the model has some limitations, the authors' insights and the broader trend towards larger and more capable multimodal AI systems suggest that the field of video understanding is poised for rapid advancements in the years to come.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

InternVideo2: Scaling Foundation Models for Multimodal Video Understanding

Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Chenting Wang, Guo Chen, Baoqi Pei, Ziang Yan, Rongkun Zheng, Jilan Xu, Zun Wang, Yansong Shi, Tianxiang Jiang, Songze Li, Hongjie Zhang, Yifei Huang, Yu Qiao, Yali Wang, Limin Wang

We introduce InternVideo2, a new family of video foundation models (ViFM) that achieve the state-of-the-art results in video recognition, video-text tasks, and video-centric dialogue. Our core design is a progressive training approach that unifies the masked video modeling, crossmodal contrastive learning, and next token prediction, scaling up the video encoder size to 6B parameters. At the data level, we prioritize spatiotemporal consistency by semantically segmenting videos and generating video-audio-speech captions. This improves the alignment between video and text. Through extensive experiments, we validate our designs and demonstrate superior performance on over 60 video and audio tasks. Notably, our model outperforms others on various video-related dialogue and long video understanding benchmarks, highlighting its ability to reason and comprehend longer contexts. Code and models are available at https://github.com/OpenGVLab/InternVideo/tree/main/InternVideo2/.

8/15/2024

Foundation Models for Video Understanding: A Survey

Neelu Madan, Andreas Moegelmose, Rajat Modi, Yogesh S. Rawat, Thomas B. Moeslund

Video Foundation Models (ViFMs) aim to learn a general-purpose representation for various video understanding tasks. Leveraging large-scale datasets and powerful models, ViFMs achieve this by capturing robust and generic features from video data. This survey analyzes over 200 video foundational models, offering a comprehensive overview of benchmarks and evaluation metrics across 14 distinct video tasks categorized into 3 main categories. Additionally, we offer an in-depth performance analysis of these models for the 6 most common video tasks. We categorize ViFMs into three categories: 1) Image-based ViFMs, which adapt existing image models for video tasks, 2) Video-Based ViFMs, which utilize video-specific encoding methods, and 3) Universal Foundational Models (UFMs), which combine multiple modalities (image, video, audio, and text etc.) within a single framework. By comparing the performance of various ViFMs on different tasks, this survey offers valuable insights into their strengths and weaknesses, guiding future advancements in video understanding. Our analysis surprisingly reveals that image-based foundation models consistently outperform video-based models on most video understanding tasks. Additionally, UFMs, which leverage diverse modalities, demonstrate superior performance on video tasks. We share the comprehensive list of ViFMs studied in this work at: url{https://github.com/NeeluMadan/ViFM_Survey.git}

5/8/2024

TWLV-I: Analysis and Insights from Holistic Evaluation on Video Foundation Models

Hyeongmin Lee, Jin-Young Kim, Kyungjune Baek, Jihwan Kim, Hyojun Go, Seongsu Ha, Seokjin Han, Jiho Jang, Raehyuk Jung, Daewoo Kim, GeunOh Kim, JongMok Kim, Jongseok Kim, Junwan Kim, Soonwoo Kwon, Jangwon Lee, Seungjoon Park, Minjoon Seo, Jay Suh, Jaehyuk Yi, Aiden Lee

In this work, we discuss evaluating video foundation models in a fair and robust manner. Unlike language or image foundation models, many video foundation models are evaluated with differing parameters (such as sampling rate, number of frames, pretraining steps, etc.), making fair and robust comparisons challenging. Therefore, we present a carefully designed evaluation framework for measuring two core capabilities of video comprehension: appearance and motion understanding. Our findings reveal that existing video foundation models, whether text-supervised like UMT or InternVideo2, or self-supervised like V-JEPA, exhibit limitations in at least one of these capabilities. As an alternative, we introduce TWLV-I, a new video foundation model that constructs robust visual representations for both motion- and appearance-based videos. Based on the average top-1 accuracy of linear probing on five action recognition benchmarks, pretrained only on publicly accessible datasets, our model shows a 4.6%p improvement compared to V-JEPA (ViT-L) and a 7.7%p improvement compared to UMT (ViT-L). Even when compared to much larger models, our model demonstrates a 7.2%p improvement compared to DFN (ViT-H), a 2.7%p improvement compared to V-JEPA (ViT-H) and a 2.8%p improvement compared to InternVideo2 (ViT-g). We provide embedding vectors obtained by TWLV-I from videos of several commonly used video benchmarks, along with evaluation source code that can directly utilize these embeddings. The code is available at https://github.com/twelvelabs-io/video-embeddings-evaluation-framework.

8/26/2024

VideoEval: Comprehensive Benchmark Suite for Low-Cost Evaluation of Video Foundation Model

Xinhao Li, Zhenpeng Huang, Jing Wang, Kunchang Li, Limin Wang

With the growth of high-quality data and advancement in visual pre-training paradigms, Video Foundation Models (VFMs) have made significant progress recently, demonstrating their remarkable performance on traditional video understanding benchmarks. However, the existing benchmarks (e.g. Kinetics) and their evaluation protocols are often limited by relatively poor diversity, high evaluation costs, and saturated performance metrics. In this paper, we build a comprehensive benchmark suite to address these issues, namely VideoEval. Specifically, we establish the Video Task Adaption Benchmark (VidTAB) and the Video Embedding Benchmark (VidEB) from two perspectives: evaluating the task adaptability of VFMs under few-shot conditions and assessing their representation power by directly applying to downstream tasks. With VideoEval, we conduct a large-scale study on 20 popular open-source vision foundation models. Our study reveals some insightful findings on VFMs: 1) overall, current VFMs exhibit weak generalization across diverse tasks, 2) increasing video data, whether labeled or weakly-labeled video-text pairs, does not necessarily improve task performance, 3) the effectiveness of some pre-training paradigms may not be fully validated in previous benchmarks, and 4) combining different pre-training paradigms can help improve the generalization capabilities. We believe this study serves as an important complement to the current evaluation for VFMs and offers valuable insights for the future research.

7/10/2024