Short Film Dataset (SFD): A Benchmark for Story-Level Video Understanding

2406.10221

Published 6/17/2024 by Ridouane Ghermi, Xi Wang, Vicky Kalogeiton, Ivan Laptev

Short Film Dataset (SFD): A Benchmark for Story-Level Video Understanding

Abstract

Recent advances in vision-language models have significantly propelled video understanding. Existing datasets and tasks, however, have notable limitations. Most datasets are confined to short videos with limited events and narrow narratives. For example, datasets with instructional and egocentric videos often document the activities of one person in a single scene. Although some movie datasets offer richer content, they are often limited to short-term tasks, lack publicly available videos and frequently encounter data leakage given the use of movie forums and other resources in LLM training. To address the above limitations, we propose the Short Film Dataset (SFD) with 1,078 publicly available amateur movies, a wide variety of genres and minimal data leakage issues. SFD offers long-term story-oriented video tasks in the form of multiple-choice and open-ended question answering. Our extensive experiments emphasize the need for long-term reasoning to solve SFD tasks. Notably, we find strong signals in movie transcripts leading to the on-par performance of people and LLMs. We also show significantly lower performance of current models compared to people when using vision data alone.

Create account to get full access

Overview

This paper introduces the Short Film Dataset (SFD), a new benchmark for story-level video understanding.
SFD consists of over 10,000 short films, providing a rich and diverse set of examples for training and evaluating video understanding models.
The dataset covers a wide range of narrative styles, genres, and production quality, making it a challenging and realistic test of a model's ability to understand the complex storylines and visual information in real-world videos.

Plain English Explanation

The researchers have created a new dataset called the Short Film Dataset (SFD) that can be used to test how well AI models can understand the stories and visuals in short videos. [This dataset can be useful for improving video understanding models, as discussed in related works like the <a href="https://aimodels.fyi/papers/arxiv/cinepile-long-video-question-answering-dataset-benchmark">CinePile dataset</a> and the <a href="https://aimodels.fyi/papers/arxiv/survey-video-datasets-grounded-event-understanding">survey of video datasets for grounded event understanding</a>.]

The SFD dataset contains over 10,000 short films, which cover a wide variety of narrative styles, genres, and production quality. This diversity makes the dataset a challenging and realistic test of a model's ability to understand the complex storylines and visual information in real-world videos. [Other benchmarks like <a href="https://aimodels.fyi/papers/arxiv/lvbench-extreme-long-video-understanding-benchmark">LVBench</a> and <a href="https://aimodels.fyi/papers/arxiv/youtube-sfvhdr-quality-dataset">YouTube-SFVHDR</a> also focus on advancing video understanding capabilities.]

By providing a large and diverse dataset of short films, the researchers hope to drive progress in developing AI systems that can truly understand the narrative and visual elements of videos, going beyond just identifying objects or recognizing actions.

Technical Explanation

The Short Film Dataset (SFD) is a new benchmark for story-level video understanding introduced in this paper. The dataset consists of over 10,000 short films from a variety of sources, covering a wide range of narrative styles, genres, and production quality.

The researchers carefully curated the dataset to ensure it provides a challenging and realistic test of a model's ability to understand the complex storylines and visual information present in real-world videos. This includes a diverse set of examples that vary in length, from a few minutes to over an hour, as well as a range of production values, from low-budget amateur films to high-quality professional productions.

To enable a wide range of video understanding tasks, the SFD dataset includes annotations for various aspects of the videos, such as shot-level descriptions, character information, and event-level narratives. This allows researchers to develop and evaluate models that can perform tasks like summarizing the plot, answering questions about the characters and their motivations, and tracking the progression of events throughout the video.

By providing this extensive and diverse dataset, the researchers aim to drive progress in the field of video understanding, going beyond the traditional focus on recognizing individual objects or actions. The SFD benchmark challenges models to truly comprehend the narrative and visual elements of videos, paving the way for more advanced applications in areas like movie recommendation, video editing, and interactive storytelling.

Critical Analysis

The Short Film Dataset (SFD) presents an important step forward in the field of video understanding, as it provides a more realistic and challenging benchmark compared to existing datasets. The diversity of the videos in the SFD, in terms of length, genre, and production quality, is a particularly valuable aspect, as it better reflects the complexity and variability of real-world video content.

However, the paper does acknowledge some limitations of the SFD. For example, the dataset is primarily focused on short films, which may not fully capture the narrative structures and visual characteristics of longer-form videos, such as feature-length movies or TV shows. [Further research could explore extending the SFD or creating complementary datasets to address this gap, as discussed in the <a href="https://aimodels.fyi/papers/arxiv/fewer-tokens-fewer-videos-extending-video-understanding">work on extending video understanding</a>.]

Additionally, the paper notes that the annotations provided with the SFD, while comprehensive, may not fully capture all the nuances and subtleties of the narrative and visual elements in the videos. Developing more sophisticated annotation schemes or leveraging human-in-the-loop approaches could potentially enhance the dataset and make it an even more valuable resource for video understanding research.

Despite these limitations, the SFD represents a significant contribution to the field and should serve as a valuable benchmark for researchers working on advancing the state-of-the-art in video understanding. By providing a diverse and challenging dataset, the SFD encourages the development of more robust and sophisticated models that can better capture the rich storytelling and visual information present in real-world videos.

Conclusion

The Short Film Dataset (SFD) introduced in this paper is a valuable new benchmark for story-level video understanding. By providing a large and diverse collection of short films, the SFD challenges researchers to develop AI models that can truly comprehend the narrative and visual elements of videos, going beyond traditional approaches focused on object recognition or action classification.

The SFD's extensive annotations and coverage of a wide range of narrative styles, genres, and production quality make it a realistic and challenging test of a model's video understanding capabilities. While the dataset has some limitations, it represents an important step forward in the field and should spur further advancements in areas like movie recommendation, video editing, and interactive storytelling.

By leveraging the SFD and similar benchmarks, researchers can continue to push the boundaries of video understanding and create AI systems that can better capture the richness and complexity of real-world video content.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🛠️

CinePile: A Long Video Question Answering Dataset and Benchmark

Ruchit Rawal, Khalid Saifullah, Ronen Basri, David Jacobs, Gowthami Somepalli, Tom Goldstein

Current datasets for long-form video understanding often fall short of providing genuine long-form comprehension challenges, as many tasks derived from these datasets can be successfully tackled by analyzing just one or a few random frames from a video. To address this issue, we present a novel dataset and benchmark, CinePile, specifically designed for authentic long-form video understanding. This paper details our innovative approach for creating a question-answer dataset, utilizing advanced LLMs with human-in-the-loop and building upon human-generated raw data. Our comprehensive dataset comprises 305,000 multiple-choice questions (MCQs), covering various visual and multimodal aspects, including temporal comprehension, understanding human-object interactions, and reasoning about events or actions within a scene. Additionally, we evaluate recent video-centric LLMs, both open-source and proprietary, on the test split of our dataset. The findings reveal that even state-of-the-art video-centric LLMs significantly lag behind human performance in these tasks, highlighting the complexity and challenge inherent in video understanding. The dataset is available at https://hf.co/datasets/tomg-group-umd/cinepile

6/17/2024

cs.CV cs.LG cs.MM

A Survey of Video Datasets for Grounded Event Understanding

Kate Sanders, Benjamin Van Durme

While existing video benchmarks largely consider specialized downstream tasks like retrieval or question-answering (QA), contemporary multimodal AI systems must be capable of well-rounded common-sense reasoning akin to human visual understanding. A critical component of human temporal-visual perception is our ability to identify and cognitively model things happening, or events. Historically, video benchmark tasks have implicitly tested for this ability (e.g., video captioning, in which models describe visual events with natural language), but they do not consider video event understanding as a task in itself. Recent work has begun to explore video analogues to textual event extraction but consists of competing task definitions and datasets limited to highly specific event types. Therefore, while there is a rich domain of event-centric video research spanning the past 10+ years, it is unclear how video event understanding should be framed and what resources we have to study it. In this paper, we survey 105 video datasets that require event understanding capability, consider how they contribute to the study of robust event understanding in video, and assess proposed video event extraction tasks in the context of this body of research. We propose suggestions informed by this survey for dataset curation and task framing, with an emphasis on the uniquely temporal nature of video events and ambiguity in visual content.

6/17/2024

cs.CV cs.AI

Multilingual Synopses of Movie Narratives: A Dataset for Story Understanding

Yidan Sun, Jianfei Yu, Boyang Li

Story video-text alignment, a core task in computational story understanding, aims to align video clips with corresponding sentences in their descriptions. However, progress on the task has been held back by the scarcity of manually annotated video-text correspondence and the heavy concentration on English narrations of Hollywood movies. To address these issues, in this paper, we construct a large-scale multilingual video story dataset named Multilingual Synopses of Movie Narratives (M-SYMON), containing 13,166 movie summary videos from 7 languages, as well as manual annotation of fine-grained video-text correspondences for 101.5 hours of video. Training on the human annotated data from SyMoN outperforms the SOTA methods by 15.7 and 16.2 percentage points on Clip Accuracy and Sentence IoU scores, respectively, demonstrating the effectiveness of the annotations. As benchmarks for future research, we create 6 baseline approaches with different multilingual training strategies, compare their performance in both intra-lingual and cross-lingual setups, exemplifying the challenges of multilingual video-text alignment.

6/21/2024

cs.CL

LVBench: An Extreme Long Video Understanding Benchmark

Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Shiyu Huang, Bin Xu, Yuxiao Dong, Ming Ding, Jie Tang

Recent progress in multimodal large language models has markedly enhanced the understanding of short videos (typically under one minute), and several evaluation datasets have emerged accordingly. However, these advancements fall short of meeting the demands of real-world applications such as embodied intelligence for long-term decision-making, in-depth movie reviews and discussions, and live sports commentary, all of which require comprehension of long videos spanning several hours. To address this gap, we introduce LVBench, a benchmark specifically designed for long video understanding. Our dataset comprises publicly sourced videos and encompasses a diverse set of tasks aimed at long video comprehension and information extraction. LVBench is designed to challenge multimodal models to demonstrate long-term memory and extended comprehension capabilities. Our extensive evaluations reveal that current multimodal models still underperform on these demanding long video understanding tasks. Through LVBench, we aim to spur the development of more advanced models capable of tackling the complexities of long video comprehension. Our data and code are publicly available at: https://lvbench.github.io.

6/13/2024

cs.CV cs.AI