LMM-VQA: Advancing Video Quality Assessment with Large Multimodal Models

Read original: arXiv:2408.14008 - Published 8/27/2024 by Qihang Ge, Wei Sun, Yu Zhang, Yunhao Li, Zhongpeng Ji, Fengyu Sun, Shangling Jui, Xiongkuo Min, Guangtao Zhai

LMM-VQA: Advancing Video Quality Assessment with Large Multimodal Models

Overview

This paper introduces LMM-VQA, a large multimodal model for video quality assessment (VQA).
LMM-VQA leverages the power of large language models to provide more accurate and comprehensive video quality scores compared to traditional VQA methods.
The model is trained on a diverse dataset of video clips and their corresponding quality scores, enabling it to learn the complex relationships between visual, audio, and contextual features that contribute to overall video quality.
Experiments show that LMM-VQA outperforms previous state-of-the-art VQA models on several benchmark datasets, demonstrating its effectiveness in advancing the field of video quality assessment.

Plain English Explanation

Video quality is an important factor in many applications, from streaming services to video conferencing. Traditionally, video quality assessment (VQA) models have relied on hand-crafted features or simple machine learning algorithms to evaluate video quality. However, these approaches often fail to capture the nuanced and complex aspects that contribute to a viewer's perception of video quality.

The researchers behind this paper have developed a new model called LMM-VQA that takes a different approach. LMM-VQA is based on large multimodal models, which are powerful artificial intelligence systems that can understand and process a wide range of data types, including text, images, and audio. By training LMM-VQA on a diverse dataset of video clips and their corresponding quality scores, the model can learn to recognize the subtle visual, auditory, and contextual cues that contribute to a viewer's overall perception of video quality.

Through experiments, the researchers have shown that LMM-VQA outperforms previous state-of-the-art VQA models on several benchmark datasets. This suggests that the use of large multimodal models can significantly improve the accuracy and robustness of video quality assessment, with potential applications in areas like video streaming, content creation, and video-based communication.

Technical Explanation

The core of the LMM-VQA model is a large multimodal architecture that can jointly process and understand video, audio, and textual information. The model is trained on a diverse dataset of video clips and their associated quality scores, which are obtained through subjective human evaluations.

During training, the model learns to extract and leverage relevant features from the different modalities (video, audio, and text) to predict the overall quality score for a given video clip. This includes learning to recognize the complex relationships between factors like visual clarity, motion, audio fidelity, and contextual information that contribute to a viewer's perception of video quality.

The researchers experiment with various model configurations and training strategies to optimize the performance of LMM-VQA. This includes exploring different ways of fusing the multimodal inputs, as well as techniques for adapting the model to specific video quality assessment tasks and datasets.

The results of the experiments show that LMM-VQA significantly outperforms previous state-of-the-art VQA models on several benchmark datasets, demonstrating the power of large multimodal models in advancing the field of video quality assessment.

Critical Analysis

One of the key strengths of the LMM-VQA model is its ability to leverage the rich and diverse information contained in video, audio, and textual modalities to provide more accurate and comprehensive video quality scores. This is a significant advancement over traditional VQA approaches, which often rely on hand-crafted features or simple machine learning algorithms that struggle to capture the nuanced and complex factors that contribute to video quality.

However, the paper also acknowledges some limitations of the LMM-VQA model, such as the need for a large and diverse dataset of video clips and their corresponding quality scores to train the model effectively. Additionally, the computational and storage requirements of the large multimodal architecture may pose challenges for deployment in certain real-world applications.

Further research could explore ways to optimize the model's efficiency and explore the potential for transfer learning or few-shot learning techniques to reduce the data requirements for training LMM-VQA. Additionally, investigating the interpretability of the model's decision-making process could provide valuable insights into the specific factors that contribute to perceived video quality.

Conclusion

The LMM-VQA model introduced in this paper represents a significant advancement in the field of video quality assessment. By leveraging the power of large multimodal models, the researchers have developed a system that can more accurately and comprehensively evaluate the quality of video content, with potential applications in areas like video streaming, content creation, and video-based communication.

The promising results of the experiments suggest that the use of large multimodal models could be a fruitful direction for further research and development in the field of video quality assessment. As the capabilities of these models continue to evolve, we can expect to see even more accurate and robust video quality evaluation tools that can better serve the needs of both content creators and consumers.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

LMM-VQA: Advancing Video Quality Assessment with Large Multimodal Models

Qihang Ge, Wei Sun, Yu Zhang, Yunhao Li, Zhongpeng Ji, Fengyu Sun, Shangling Jui, Xiongkuo Min, Guangtao Zhai

The explosive growth of videos on streaming media platforms has underscored the urgent need for effective video quality assessment (VQA) algorithms to monitor and perceptually optimize the quality of streaming videos. However, VQA remains an extremely challenging task due to the diverse video content and the complex spatial and temporal distortions, thus necessitating more advanced methods to address these issues. Nowadays, large multimodal models (LMMs), such as GPT-4V, have exhibited strong capabilities for various visual understanding tasks, motivating us to leverage the powerful multimodal representation ability of LMMs to solve the VQA task. Therefore, we propose the first Large Multi-Modal Video Quality Assessment (LMM-VQA) model, which introduces a novel spatiotemporal visual modeling strategy for quality-aware feature extraction. Specifically, we first reformulate the quality regression problem into a question and answering (Q&A) task and construct Q&A prompts for VQA instruction tuning. Then, we design a spatiotemporal vision encoder to extract spatial and temporal features to represent the quality characteristics of videos, which are subsequently mapped into the language space by the spatiotemporal projector for modality alignment. Finally, the aligned visual tokens and the quality-inquired text tokens are aggregated as inputs for the large language model (LLM) to generate the quality score and level. Extensive experiments demonstrate that LMM-VQA achieves state-of-the-art performance across five VQA benchmarks, exhibiting an average improvement of $5%$ in generalization ability over existing methods. Furthermore, due to the advanced design of the spatiotemporal encoder and projector, LMM-VQA also performs exceptionally well on general video understanding tasks, further validating its effectiveness. Our code will be released at https://github.com/Sueqk/LMM-VQA.

8/27/2024

Beyond Raw Videos: Understanding Edited Videos with Large Multimodal Model

Lu Xu, Sijie Zhu, Chunyuan Li, Chia-Wen Kuo, Fan Chen, Xinyao Wang, Guang Chen, Dawei Du, Ye Yuan, Longyin Wen

The emerging video LMMs (Large Multimodal Models) have achieved significant improvements on generic video understanding in the form of VQA (Visual Question Answering), where the raw videos are captured by cameras. However, a large portion of videos in real-world applications are edited videos, textit{e.g.}, users usually cut and add effects/modifications to the raw video before publishing it on social media platforms. The edited videos usually have high view counts but they are not covered in existing benchmarks of video LMMs, textit{i.e.}, ActivityNet-QA, or VideoChatGPT benchmark. In this paper, we leverage the edited videos on a popular short video platform, textit{i.e.}, TikTok, and build a video VQA benchmark (named EditVid-QA) covering four typical editing categories, i.e., effect, funny, meme, and game. Funny and meme videos benchmark nuanced understanding and high-level reasoning, while effect and game evaluate the understanding capability of artificial design. Most of the open-source video LMMs perform poorly on the EditVid-QA benchmark, indicating a huge domain gap between edited short videos on social media and regular raw videos. To improve the generalization ability of LMMs, we collect a training set for the proposed benchmark based on both Panda-70M/WebVid raw videos and small-scale TikTok/CapCut edited videos, which boosts the performance on the proposed EditVid-QA benchmark, indicating the effectiveness of high-quality training data. We also identified a serious issue in the existing evaluation protocol using the GPT-3.5 judge, namely a sorry attack, where a sorry-style naive answer can achieve an extremely high rating from the GPT judge, e.g., over 4.3 for correctness score on VideoChatGPT evaluation protocol. To avoid the sorry attacks, we evaluate results with GPT-4 judge and keyword filtering. The datasets will be released for academic purposes only.

6/18/2024

Weakly Supervised Gaussian Contrastive Grounding with Large Multimodal Models for Video Question Answering

Haibo Wang, Chenghang Lai, Yixuan Sun, Weifeng Ge

Video Question Answering (VideoQA) aims to answer natural language questions based on the information observed in videos. Despite the recent success of Large Multimodal Models (LMMs) in image-language understanding and reasoning, they deal with VideoQA insufficiently, by simply taking uniformly sampled frames as visual inputs, which ignores question-relevant visual clues. Moreover, there are no human annotations for question-critical timestamps in existing VideoQA datasets. In light of this, we propose a novel weakly supervised framework to enforce the LMMs to reason out the answers with question-critical moments as visual inputs. Specifically, we first fuse the question and answer pairs as event descriptions to find multiple keyframes as target moments and pseudo-labels, with the visual-language alignment capability of the CLIP models. With these pseudo-labeled keyframes as additionally weak supervision, we devise a lightweight Gaussian-based Contrastive Grounding (GCG) module. GCG learns multiple Gaussian functions to characterize the temporal structure of the video, and sample question-critical frames as positive moments to be the visual inputs of LMMs. Extensive experiments on several benchmarks verify the effectiveness of our framework, and we achieve substantial improvements compared to previous state-of-the-art methods.

7/24/2024

PTM-VQA: Efficient Video Quality Assessment Leveraging Diverse PreTrained Models from the Wild

Kun Yuan, Hongbo Liu, Mading Li, Muyi Sun, Ming Sun, Jiachao Gong, Jinhua Hao, Chao Zhou, Yansong Tang

Video quality assessment (VQA) is a challenging problem due to the numerous factors that can affect the perceptual quality of a video, eg, content attractiveness, distortion type, motion pattern, and level. However, annotating the Mean opinion score (MOS) for videos is expensive and time-consuming, which limits the scale of VQA datasets, and poses a significant obstacle for deep learning-based methods. In this paper, we propose a VQA method named PTM-VQA, which leverages PreTrained Models to transfer knowledge from models pretrained on various pre-tasks, enabling benefits for VQA from different aspects. Specifically, we extract features of videos from different pretrained models with frozen weights and integrate them to generate representation. Since these models possess various fields of knowledge and are often trained with labels irrelevant to quality, we propose an Intra-Consistency and Inter-Divisibility (ICID) loss to impose constraints on features extracted by multiple pretrained models. The intra-consistency constraint ensures that features extracted by different pretrained models are in the same unified quality-aware latent space, while the inter-divisibility introduces pseudo clusters based on the annotation of samples and tries to separate features of samples from different clusters. Furthermore, with a constantly growing number of pretrained models, it is crucial to determine which models to use and how to use them. To address this problem, we propose an efficient scheme to select suitable candidates. Models with better clustering performance on VQA datasets are chosen to be our candidates. Extensive experiments demonstrate the effectiveness of the proposed method.

5/29/2024