TC-Bench: Benchmarking Temporal Compositionality in Text-to-Video and Image-to-Video Generation

2406.08656

Published 6/14/2024 by Weixi Feng, Jiachen Li, Michael Saxon, Tsu-jui Fu, Wenhu Chen, William Yang Wang

TC-Bench: Benchmarking Temporal Compositionality in Text-to-Video and Image-to-Video Generation

Abstract

Video generation has many unique challenges beyond those of image generation. The temporal dimension introduces extensive possible variations across frames, over which consistency and continuity may be violated. In this study, we move beyond evaluating simple actions and argue that generated videos should incorporate the emergence of new concepts and their relation transitions like in real-world videos as time progresses. To assess the Temporal Compositionality of video generation models, we propose TC-Bench, a benchmark of meticulously crafted text prompts, corresponding ground truth videos, and robust evaluation metrics. The prompts articulate the initial and final states of scenes, effectively reducing ambiguities for frame development and simplifying the assessment of transition completion. In addition, by collecting aligned real-world videos corresponding to the prompts, we expand TC-Bench's applicability from text-conditional models to image-conditional ones that can perform generative frame interpolation. We also develop new metrics to measure the completeness of component transitions in generated videos, which demonstrate significantly higher correlations with human judgments than existing metrics. Our comprehensive experimental results reveal that most video generators achieve less than 20% of the compositional changes, highlighting enormous space for future improvement. Our analysis indicates that current video generation models struggle to interpret descriptions of compositional changes and synthesize various components across different time steps.

Create account to get full access

Overview

• This paper introduces TC-Bench, a benchmark for evaluating the temporal compositionality of text-to-video and image-to-video generation models. • Temporal compositionality refers to a model's ability to generate videos that seamlessly combine and transition between different visual elements over time based on the input text or image. • TC-Bench includes a diverse dataset of video clips and associated text/image prompts that test various aspects of temporal compositionality.

Plain English Explanation

The paper presents a new benchmark called TC-Bench to evaluate how well AI models can generate videos that smoothly combine and transition between different visual elements over time based on text or image inputs. Temporal compositionality is an important capability for text-to-video and image-to-video generation models, as it allows them to create videos where the visuals logically flow and unfold over time in response to the input prompt.

VideoTetris: Towards Compositional Text-to-Video Generation and Leveraging Temporal Contextualization for Video Action Recognition have explored related ideas around compositional video generation. TC-Bench aims to provide a standardized way to test and compare models on this key capability.

The benchmark includes a dataset of video clips with associated text or image prompts that capture different aspects of temporal compositionality, such as transitioning between actions, maintaining object continuity, and depicting causal relationships. By evaluating models on this diverse set of test cases, researchers can gain insight into the strengths and weaknesses of different approaches to text-to-video and image-to-video generation.

Technical Explanation

The paper introduces TC-Bench, a benchmark for evaluating the temporal compositionality of text-to-video and image-to-video generation models. Temporal compositionality refers to a model's ability to generate videos that seamlessly combine and transition between different visual elements over time based on the input text or image.

TC-Bench includes a diverse dataset of video clips and associated text/image prompts designed to test various aspects of temporal compositionality, including:

Transitioning between actions or states
Maintaining object and character continuity over time
Depicting causal relationships and logical progressions
Incorporating fine-grained temporal details

The benchmark is intended to complement existing video generation evaluation metrics, which often focus on static image quality or overall video realism, by providing a more targeted assessment of a model's ability to handle the temporal dimension.

The paper also discusses several baseline experiments using state-of-the-art text-to-video and image-to-video generation models, highlighting their performance on the TC-Bench dataset. These results demonstrate the utility of the benchmark in identifying areas for improvement in temporal compositionality.

SWAP-Attention: Spatiotemporal Diffusions for Text-to-Video Generation and Understanding and Mitigating Compositional Issues in Text-to-Image Generation have explored related challenges in compositional image and video generation that inform the design of TC-Bench.

Critical Analysis

The TC-Bench benchmark addresses an important and underexplored aspect of video generation - the ability to create videos that seamlessly combine and transition between different visual elements over time. This is a crucial capability for practical applications of text-to-video and image-to-video generation, where the generated videos need to be coherent, logical, and temporally consistent.

One potential limitation of the benchmark is the reliance on human-annotated video clips and prompts, which may not fully capture the range of possible temporal compositionality challenges. There could be value in exploring procedurally generated test cases or incorporating more diverse and open-ended prompts.

Additionally, the paper does not delve deeply into the specific architectural or training approaches that may be most effective for addressing the temporal compositionality issues highlighted by TC-Bench. Further research into the underlying factors that enable strong temporal compositionality, and how they can be incorporated into video generation models, would be a valuable next step.

VideoPhys: Evaluating Physical Commonsense in Video Generation explores another important dimension of video generation that could be combined with the temporal compositionality focus of TC-Bench to provide a more holistic evaluation framework.

Conclusion

The TC-Bench benchmark proposed in this paper is a valuable contribution to the field of text-to-video and image-to-video generation. By focusing on the crucial aspect of temporal compositionality, it provides a more targeted and meaningful way to evaluate the capabilities of these models. The diverse test cases and baseline results presented in the paper demonstrate the utility of TC-Bench and highlight areas for further research and improvement.

As text-to-video and image-to-video generation models continue to advance, the ability to seamlessly combine and transition between visual elements over time will become increasingly important. TC-Bench offers a standardized way to measure and track progress in this key area, potentially leading to the development of more coherent, logical, and temporally consistent video generation systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🛸

GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation

Baiqi Li, Zhiqiu Lin, Deepak Pathak, Jiayao Li, Yixin Fei, Kewen Wu, Tiffany Ling, Xide Xia, Pengchuan Zhang, Graham Neubig, Deva Ramanan

While text-to-visual models now produce photo-realistic images and videos, they struggle with compositional text prompts involving attributes, relationships, and higher-order reasoning such as logic and comparison. In this work, we conduct an extensive human study on GenAI-Bench to evaluate the performance of leading image and video generation models in various aspects of compositional text-to-visual generation. We also compare automated evaluation metrics against our collected human ratings and find that VQAScore -- a metric measuring the likelihood that a VQA model views an image as accurately depicting the prompt -- significantly outperforms previous metrics such as CLIPScore. In addition, VQAScore can improve generation in a black-box manner (without finetuning) via simply ranking a few (3 to 9) candidate images. Ranking by VQAScore is 2x to 3x more effective than other scoring methods like PickScore, HPSv2, and ImageReward at improving human alignment ratings for DALL-E 3 and Stable Diffusion, especially on compositional prompts that require advanced visio-linguistic reasoning. We will release a new GenAI-Rank benchmark with over 40,000 human ratings to evaluate scoring metrics on ranking images generated from the same prompt. Lastly, we discuss promising areas for improvement in VQAScore, such as addressing fine-grained visual details. We will release all human ratings (over 80,000) to facilitate scientific benchmarking of both generative models and automated metrics.

6/26/2024

cs.CV cs.AI cs.CL cs.LG cs.MM

VideoTetris: Towards Compositional Text-to-Video Generation

Ye Tian, Ling Yang, Haotian Yang, Yuan Gao, Yufan Deng, Jingmin Chen, Xintao Wang, Zhaochen Yu, Xin Tao, Pengfei Wan, Di Zhang, Bin Cui

Diffusion models have demonstrated great success in text-to-video (T2V) generation. However, existing methods may face challenges when handling complex (long) video generation scenarios that involve multiple objects or dynamic changes in object numbers. To address these limitations, we propose VideoTetris, a novel framework that enables compositional T2V generation. Specifically, we propose spatio-temporal compositional diffusion to precisely follow complex textual semantics by manipulating and composing the attention maps of denoising networks spatially and temporally. Moreover, we propose an enhanced video data preprocessing to enhance the training data regarding motion dynamics and prompt understanding, equipped with a new reference frame attention mechanism to improve the consistency of auto-regressive video generation. Extensive experiments demonstrate that our VideoTetris achieves impressive qualitative and quantitative results in compositional T2V generation. Code is available at: https://github.com/YangLing0818/VideoTetris

6/7/2024

cs.CV

ChronoMagic-Bench: A Benchmark for Metamorphic Evaluation of Text-to-Time-lapse Video Generation

Shenghai Yuan, Jinfa Huang, Yongqi Xu, Yaoyang Liu, Shaofeng Zhang, Yujun Shi, Ruijie Zhu, Xinhua Cheng, Jiebo Luo, Li Yuan

We propose a novel text-to-video (T2V) generation benchmark, ChronoMagic-Bench, to evaluate the temporal and metamorphic capabilities of the T2V models (e.g. Sora and Lumiere) in time-lapse video generation. In contrast to existing benchmarks that focus on the visual quality and textual relevance of generated videos, ChronoMagic-Bench focuses on the model's ability to generate time-lapse videos with significant metamorphic amplitude and temporal coherence. The benchmark probes T2V models for their physics, biology, and chemistry capabilities, in a free-form text query. For these purposes, ChronoMagic-Bench introduces 1,649 prompts and real-world videos as references, categorized into four major types of time-lapse videos: biological, human-created, meteorological, and physical phenomena, which are further divided into 75 subcategories. This categorization comprehensively evaluates the model's capacity to handle diverse and complex transformations. To accurately align human preference with the benchmark, we introduce two new automatic metrics, MTScore and CHScore, to evaluate the videos' metamorphic attributes and temporal coherence. MTScore measures the metamorphic amplitude, reflecting the degree of change over time, while CHScore assesses the temporal coherence, ensuring the generated videos maintain logical progression and continuity. Based on the ChronoMagic-Bench, we conduct comprehensive manual evaluations of ten representative T2V models, revealing their strengths and weaknesses across different categories of prompts, and providing a thorough evaluation framework that addresses current gaps in video generation research. Moreover, we create a large-scale ChronoMagic-Pro dataset, containing 460k high-quality pairs of 720p time-lapse videos and detailed captions ensuring high physical pertinence and large metamorphic amplitude.

6/27/2024

cs.CV cs.CL

Leveraging Temporal Contextualization for Video Action Recognition

Minji Kim, Dongyoon Han, Taekyung Kim, Bohyung Han

Pretrained vision-language models have shown effectiveness in video understanding. However, recent studies have not sufficiently leveraged essential temporal information from videos, simply averaging frame-wise representations or referencing consecutive frames. We introduce Temporally Contextualized CLIP (TC-CLIP), a pioneering framework for video understanding that effectively and efficiently leverages comprehensive video information. We propose Temporal Contextualization (TC), a novel layer-wise temporal information infusion mechanism for video that extracts core information from each frame, interconnects relevant information across the video to summarize into context tokens, and ultimately leverages the context tokens during the feature encoding process. Furthermore, our Video-conditional Prompting (VP) module manufactures context tokens to generate informative prompts in text modality. We conduct extensive experiments in zero-shot, few-shot, base-to-novel, and fully-supervised action recognition to validate the superiority of our TC-CLIP. Ablation studies for TC and VP guarantee our design choices. Code is available at https://github.com/naver-ai/tc-clip

4/16/2024

cs.CV