MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions

Read original: arXiv:2407.06358 - Published 7/10/2024 by Xuan Ju, Yiming Gao, Zhaoyang Zhang, Ziyang Yuan, Xintao Wang, Ailing Zeng, Yu Xiong, Qiang Xu, Ying Shan

🛸

Overview

The paper presents MiraData, a large-scale video dataset with long durations and structured captions.
MiraData contains over 1 million videos with an average duration of 3 minutes, significantly longer than most existing video datasets.
The videos are accompanied by detailed captions that provide structured information about the content, including descriptions of actions, objects, and scenes.
The dataset is designed to support research in areas such as video-to-text generation, video understanding, and video retrieval.

Plain English Explanation

The researchers have created a new video dataset called MiraData that is significantly larger and more detailed than previous datasets. Most existing video datasets have short clips, typically less than a minute long. In contrast, MiraData contains over 1 million videos with an average duration of 3 minutes. This allows for the capture of more complex and realistic video content.

In addition to the longer video lengths, MiraData also provides detailed captions for each video. These captions go beyond simple descriptions and include structured information about the actions, objects, and scenes depicted. For example, a caption might describe the people, their activities, the environment, and any notable events or interactions.

The goal of MiraData is to enable better research and development in areas like video-to-text generation, where the aim is to automatically generate informative text descriptions from video input. It can also be used to improve video understanding and video retrieval systems, which are essential for many real-world applications.

Technical Explanation

The MiraData dataset contains over 1 million videos with an average duration of 3 minutes, significantly longer than most existing video datasets. The videos cover a wide range of topics, including daily life, sports, entertainment, and more. Each video is accompanied by a detailed caption that provides structured information about the content, including descriptions of actions, objects, and scenes.

The captions were generated using a novel captioning model that was trained on a large corpus of video-text pairs. The model is able to generate captions that are not only grammatically correct but also semantically coherent and informative. The captions include not only general descriptions of the video content but also specific details about the people, objects, and events depicted.

To evaluate the quality and usefulness of the dataset, the researchers conducted several experiments. They trained video-to-text generation models on the MiraData dataset and compared their performance to models trained on other video datasets. The results showed that the MiraData-trained models were able to generate more informative and coherent captions, demonstrating the value of the dataset's long durations and structured captions.

The researchers also explored the use of MiraData for video understanding and video retrieval tasks. They showed that models trained on MiraData were able to achieve state-of-the-art performance on a variety of benchmark tasks, further highlighting the potential of the dataset to advance research in these areas.

Critical Analysis

The MiraData dataset represents a significant advancement in video data collection and annotation. The long video durations and detailed, structured captions are valuable features that can enable new research directions and push the boundaries of current video understanding and generation capabilities.

One potential limitation of the dataset is the diversity of the video content. While the dataset covers a wide range of topics, it may not be representative of the full breadth of video content available on the internet or in real-world applications. Additionally, the captions, while detailed, may be subject to biases or inconsistencies in the annotation process.

Further research is needed to fully understand the strengths and limitations of the MiraData dataset. For example, it would be interesting to explore how models trained on MiraData perform on out-of-domain video data or how the dataset can be used to study the relationship between video content and language.

Overall, the MiraData dataset is a valuable resource that has the potential to significantly advance research in video understanding and generation. By providing long-form videos with rich, structured captions, the dataset opens up new avenues for exploration and innovation in the field of computer vision and natural language processing.

Conclusion

The MiraData dataset presented in this paper represents a significant advancement in video data collection and annotation. By providing over 1 million videos with an average duration of 3 minutes and detailed, structured captions, the dataset enables new research directions in video-to-text generation, video understanding, and video retrieval.

The long video durations and rich captions captured in MiraData allow for the study of more complex and realistic video content, going beyond the short clips that have dominated much of the existing video dataset landscape. The structured nature of the captions, which include detailed information about actions, objects, and scenes, can also enable the development of more sophisticated video understanding and generation models.

While the dataset may have some limitations, such as potential biases in the annotation process or a lack of diversity in the video content, it represents a significant step forward in the field of computer vision and natural language processing. By making MiraData publicly available, the researchers have opened the door for further exploration and innovation, with the potential to drive new breakthroughs in areas like video-text alignment and multimodal understanding.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🛸

MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions

Xuan Ju, Yiming Gao, Zhaoyang Zhang, Ziyang Yuan, Xintao Wang, Ailing Zeng, Yu Xiong, Qiang Xu, Ying Shan

Sora's high-motion intensity and long consistent videos have significantly impacted the field of video generation, attracting unprecedented attention. However, existing publicly available datasets are inadequate for generating Sora-like videos, as they mainly contain short videos with low motion intensity and brief captions. To address these issues, we propose MiraData, a high-quality video dataset that surpasses previous ones in video duration, caption detail, motion strength, and visual quality. We curate MiraData from diverse, manually selected sources and meticulously process the data to obtain semantically consistent clips. GPT-4V is employed to annotate structured captions, providing detailed descriptions from four different perspectives along with a summarized dense caption. To better assess temporal consistency and motion intensity in video generation, we introduce MiraBench, which enhances existing benchmarks by adding 3D consistency and tracking-based motion strength metrics. MiraBench includes 150 evaluation prompts and 17 metrics covering temporal consistency, motion strength, 3D consistency, visual quality, text-video alignment, and distribution similarity. To demonstrate the utility and effectiveness of MiraData, we conduct experiments using our DiT-based video generation model, MiraDiT. The experimental results on MiraBench demonstrate the superiority of MiraData, especially in motion strength.

7/10/2024

OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation

Kepan Nan, Rui Xie, Penghao Zhou, Tiehan Fan, Zhenheng Yang, Zhijie Chen, Xiang Li, Jian Yang, Ying Tai

Text-to-video (T2V) generation has recently garnered significant attention thanks to the large multi-modality model Sora. However, T2V generation still faces two important challenges: 1) Lacking a precise open sourced high-quality dataset. The previous popular video datasets, e.g. WebVid-10M and Panda-70M, are either with low quality or too large for most research institutions. Therefore, it is challenging but crucial to collect a precise high-quality text-video pairs for T2V generation. 2) Ignoring to fully utilize textual information. Recent T2V methods have focused on vision transformers, using a simple cross attention module for video generation, which falls short of thoroughly extracting semantic information from text prompt. To address these issues, we introduce OpenVid-1M, a precise high-quality dataset with expressive captions. This open-scenario dataset contains over 1 million text-video pairs, facilitating research on T2V generation. Furthermore, we curate 433K 1080p videos from OpenVid-1M to create OpenVidHD-0.4M, advancing high-definition video generation. Additionally, we propose a novel Multi-modal Video Diffusion Transformer (MVDiT) capable of mining both structure information from visual tokens and semantic information from text tokens. Extensive experiments and ablation studies verify the superiority of OpenVid-1M over previous datasets and the effectiveness of our MVDiT.

8/6/2024

Vript: A Video Is Worth Thousands of Words

Dongjie Yang, Suyuan Huang, Chengqiang Lu, Xiaodong Han, Haoxin Zhang, Yan Gao, Yao Hu, Hai Zhao

Advancements in multimodal learning, particularly in video understanding and generation, require high-quality video-text datasets for improved model performance. Vript addresses this issue with a meticulously annotated corpus of 12K high-resolution videos, offering detailed, dense, and script-like captions for over 420K clips. Each clip has a caption of ~145 words, which is over 10x longer than most video-text datasets. Unlike captions only documenting static content in previous datasets, we enhance video captioning to video scripting by documenting not just the content, but also the camera operations, which include the shot types (medium shot, close-up, etc) and camera movements (panning, tilting, etc). By utilizing the Vript, we explore three training paradigms of aligning more text with the video modality rather than clip-caption pairs. This results in Vriptor, a top-performing video captioning model among open-source models, comparable to GPT-4V in performance. Vriptor is also a powerful model capable of end-to-end generation of dense and detailed captions for long videos. Moreover, we introduce Vript-Hard, a benchmark consisting of three video understanding tasks that are more challenging than existing benchmarks: Vript-HAL is the first benchmark evaluating action and object hallucinations in video LLMs, Vript-RR combines reasoning with retrieval resolving question ambiguity in long-video QAs, and Vript-ERO is a new task to evaluate the temporal understanding of events in long videos rather than actions in short videos in previous works. All code, models, and datasets are available in https://github.com/mutonix/Vript.

6/11/2024

VideoVista: A Versatile Benchmark for Video Understanding and Reasoning

Yunxin Li, Xinyu Chen, Baotian Hu, Longyue Wang, Haoyuan Shi, Min Zhang

Despite significant breakthroughs in video analysis driven by the rapid development of large multimodal models (LMMs), there remains a lack of a versatile evaluation benchmark to comprehensively assess these models' performance in video understanding and reasoning. To address this, we present VideoVista, a video QA benchmark that integrates challenges across diverse content categories, durations, and abilities. Specifically, VideoVista comprises 25,000 questions derived from 3,400 videos spanning 14 categories (e.g., Howto, Film, and Entertainment) with durations ranging from a few seconds to over 10 minutes. Besides, it encompasses 19 types of understanding tasks (e.g., anomaly detection, interaction understanding) and 8 reasoning tasks (e.g., logical reasoning, causal reasoning). To achieve this, we present an automatic data construction framework, leveraging powerful GPT-4o alongside advanced analysis tools (e.g., video splitting, object segmenting, and tracking). We also utilize this framework to construct training data to enhance the capabilities of video-related LMMs (Video-LMMs). Through a comprehensive and quantitative evaluation of cutting-edge models, we reveal that: 1) Video-LMMs face difficulties in fine-grained video tasks involving temporal location, object tracking, and anomaly detection; 2) Video-LMMs present inferior logical and relation reasoning abilities; 3) Open-source Video-LMMs' performance is significantly lower than GPT-4o and Gemini-1.5, lagging by 20 points. This highlights the crucial role VideoVista will play in advancing LMMs that can accurately understand videos and perform precise reasoning.

6/18/2024