Subjective-Aligned Dataset and Metric for Text-to-Video Quality Assessment

2403.11956

Published 5/21/2024 by Tengchuan Kou, Xiaohong Liu, Zicheng Zhang, Chunyi Li, Haoning Wu, Xiongkuo Min, Guangtao Zhai, Ning Liu

cs.CV

Subjective-Aligned Dataset and Metric for Text-to-Video Quality Assessment

Abstract

With the rapid development of generative models, Artificial Intelligence-Generated Contents (AIGC) have exponentially increased in daily lives. Among them, Text-to-Video (T2V) generation has received widespread attention. Though many T2V models have been released for generating high perceptual quality videos, there is still lack of a method to evaluate the quality of these videos quantitatively. To solve this issue, we establish the largest-scale Text-to-Video Quality Assessment DataBase (T2VQA-DB) to date. The dataset is composed of 10,000 videos generated by 9 different T2V models. We also conduct a subjective study to obtain each video's corresponding mean opinion score. Based on T2VQA-DB, we propose a novel transformer-based model for subjective-aligned Text-to-Video Quality Assessment (T2VQA). The model extracts features from text-video alignment and video fidelity perspectives, then it leverages the ability of a large language model to give the prediction score. Experimental results show that T2VQA outperforms existing T2V metrics and SOTA video quality assessment models. Quantitative analysis indicates that T2VQA is capable of giving subjective-align predictions, validating its effectiveness. The dataset and code will be released at https://github.com/QMME/T2VQA.

Create account to get full access

Overview

This paper introduces a new dataset and metric for evaluating the quality of text-to-video generation models.
The dataset, called TVQE, contains subjective video quality ratings from human participants for a diverse set of text-to-video samples.
The authors also propose a new metric, called TVQA, which aims to align with human perceptions of video quality.

Plain English Explanation

The paper focuses on the important task of evaluating the quality of text-to-video generation models. These models take text descriptions as input and generate corresponding video outputs. However, measuring the quality of these video outputs is challenging, as it involves subjective human perceptions.

To address this, the researchers created a new dataset called TVQE (link). This dataset contains a large collection of text-to-video samples, each of which has been rated by human participants on various aspects of video quality. By capturing these subjective ratings, the TVQE dataset provides a valuable resource for training and evaluating text-to-video models.

Additionally, the authors developed a new quality assessment metric called TVQA (link). This metric is designed to align with human perceptions of video quality, going beyond traditional technical metrics. By using the TVQE dataset, the TVQA metric can provide a more meaningful and reliable way to measure the performance of text-to-video generation models.

Technical Explanation

The TVQE dataset consists of over 10,000 text-to-video samples, each with corresponding subjective quality ratings from human participants. The text-to-video samples cover a diverse range of topics and visual styles, and the ratings cover various aspects of video quality, such as realism, coherence, and aesthetic appeal.

To develop the TVQA metric, the authors used the TVQE dataset to train a deep learning model that can predict video quality ratings. The model takes into account both the input text and the generated video, and it is trained to output a quality score that aligns with the subjective ratings in the TVQE dataset.

The authors evaluate the TVQA metric on several text-to-video generation benchmarks, including AIGIQA-20K, MTVQA, and the AIS 2024 Challenge. The results show that TVQA outperforms traditional metrics in terms of aligning with human perceptions of video quality.

Critical Analysis

While the TVQE dataset and TVQA metric represent an important contribution to the field of text-to-video generation, there are a few potential limitations to consider:

The dataset may not capture the full diversity of text-to-video samples, as it is limited to the specific samples included in the collection.
The subjective ratings in the TVQE dataset could be influenced by individual biases or preferences, which may not generalize to all users.
The TVQA metric, while designed to align with human perceptions, may still have limitations in capturing the nuances of video quality.

Further research could explore ways to expand the TVQE dataset, incorporate additional types of subjective feedback, and refine the TVQA metric to address these potential limitations.

Conclusion

This paper presents a novel dataset and metric for evaluating the quality of text-to-video generation models. The TVQE dataset provides a valuable resource for training and evaluating these models, while the TVQA metric offers a more reliable way to assess their performance in alignment with human perceptions of video quality.

By addressing the challenge of subjective video quality assessment, this research represents an important step forward in the development of advanced text-to-video generation systems. The insights and tools provided in this paper can inform future work in this rapidly evolving field, ultimately leading to more realistic and engaging text-to-video experiences.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Exploring AIGC Video Quality: A Focus on Visual Harmony, Video-Text Consistency and Domain Distribution Gap

Bowen Qu, Xiaoyu Liang, Shangkun Sun, Wei Gao

The recent advancements in Text-to-Video Artificial Intelligence Generated Content (AIGC) have been remarkable. Compared with traditional videos, the assessment of AIGC videos encounters various challenges: visual inconsistency that defy common sense, discrepancies between content and the textual prompt, and distribution gap between various generative models, etc. Target at these challenges, in this work, we categorize the assessment of AIGC video quality into three dimensions: visual harmony, video-text consistency, and domain distribution gap. For each dimension, we design specific modules to provide a comprehensive quality assessment of AIGC videos. Furthermore, our research identifies significant variations in visual quality, fluidity, and style among videos generated by different text-to-video models. Predicting the source generative model can make the AIGC video features more discriminative, which enhances the quality assessment performance. The proposed method was used in the third-place winner of the NTIRE 2024 Quality Assessment for AI-Generated Content - Track 2 Video, demonstrating its effectiveness. Code will be available at https://github.com/Coobiw/TriVQA.

4/30/2024

cs.CV

TAVGBench: Benchmarking Text to Audible-Video Generation

Yuxin Mao, Xuyang Shen, Jing Zhang, Zhen Qin, Jinxing Zhou, Mochu Xiang, Yiran Zhong, Yuchao Dai

The Text to Audible-Video Generation (TAVG) task involves generating videos with accompanying audio based on text descriptions. Achieving this requires skillful alignment of both audio and video elements. To support research in this field, we have developed a comprehensive Text to Audible-Video Generation Benchmark (TAVGBench), which contains over 1.7 million clips with a total duration of 11.8 thousand hours. We propose an automatic annotation pipeline to ensure each audible video has detailed descriptions for both its audio and video contents. We also introduce the Audio-Visual Harmoni score (AVHScore) to provide a quantitative measure of the alignment between the generated audio and video modalities. Additionally, we present a baseline model for TAVG called TAVDiffusion, which uses a two-stream latent diffusion model to provide a fundamental starting point for further research in this area. We achieve the alignment of audio and video by employing cross-attention and contrastive learning. Through extensive experiments and evaluations on TAVGBench, we demonstrate the effectiveness of our proposed model under both conventional metrics and our proposed metrics.

4/23/2024

cs.CV cs.MM

🛸

Evaluating Text-to-Visual Generation with Image-to-Text Generation

Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, Deva Ramanan

Despite significant progress in generative AI, comprehensive evaluation remains challenging because of the lack of effective metrics and standardized benchmarks. For instance, the widely-used CLIPScore measures the alignment between a (generated) image and text prompt, but it fails to produce reliable scores for complex prompts involving compositions of objects, attributes, and relations. One reason is that text encoders of CLIP can notoriously act as a bag of words, conflating prompts such as the horse is eating the grass with the grass is eating the horse. To address this, we introduce the VQAScore, which uses a visual-question-answering (VQA) model to produce an alignment score by computing the probability of a Yes answer to a simple Does this figure show '{text}'? question. Though simpler than prior art, VQAScore computed with off-the-shelf models produces state-of-the-art results across many (8) image-text alignment benchmarks. We also compute VQAScore with an in-house model that follows best practices in the literature. For example, we use a bidirectional image-question encoder that allows image embeddings to depend on the question being asked (and vice versa). Our in-house model, CLIP-FlanT5, outperforms even the strongest baselines that make use of the proprietary GPT-4V. Interestingly, although we train with only images, VQAScore can also align text with video and 3D models. VQAScore allows researchers to benchmark text-to-visual generation using complex texts that capture the compositional structure of real-world prompts. We introduce GenAI-Bench, a more challenging benchmark with 1,600 compositional text prompts that require parsing scenes, objects, attributes, relationships, and high-order reasoning like comparison and logic. GenAI-Bench also offers over 15,000 human ratings for leading image and video generation models such as Stable Diffusion, DALL-E 3, and Gen2.

6/19/2024

cs.CV cs.AI cs.CL cs.LG cs.MM

🤿

Rethinking Human Evaluation Protocol for Text-to-Video Models: Enhancing Reliability,Reproducibility, and Practicality

Tianle Zhang, Langtian Ma, Yuchen Yan, Yuchen Zhang, Kai Wang, Yue Yang, Ziyao Guo, Wenqi Shao, Yang You, Yu Qiao, Ping Luo, Kaipeng Zhang

Recent text-to-video (T2V) technology advancements, as demonstrated by models such as Gen2, Pika, and Sora, have significantly broadened its applicability and popularity. Despite these strides, evaluating these models poses substantial challenges. Primarily, due to the limitations inherent in automatic metrics, manual evaluation is often considered a superior method for assessing T2V generation. However, existing manual evaluation protocols face reproducibility, reliability, and practicality issues. To address these challenges, this paper introduces the Text-to-Video Human Evaluation (T2VHE) protocol, a comprehensive and standardized protocol for T2V models. The T2VHE protocol includes well-defined metrics, thorough annotator training, and an effective dynamic evaluation module. Experimental results demonstrate that this protocol not only ensures high-quality annotations but can also reduce evaluation costs by nearly 50%. We will open-source the entire setup of the T2VHE protocol, including the complete protocol workflow, the dynamic evaluation component details, and the annotation interface code. This will help communities establish more sophisticated human assessment protocols.

6/14/2024

cs.CV