PTM-VQA: Efficient Video Quality Assessment Leveraging Diverse PreTrained Models from the Wild

2405.17765

Published 5/29/2024 by Kun Yuan, Hongbo Liu, Mading Li, Muyi Sun, Ming Sun, Jiachao Gong, Jinhua Hao, Chao Zhou, Yansong Tang

cs.CV

PTM-VQA: Efficient Video Quality Assessment Leveraging Diverse PreTrained Models from the Wild

Abstract

Video quality assessment (VQA) is a challenging problem due to the numerous factors that can affect the perceptual quality of a video, eg, content attractiveness, distortion type, motion pattern, and level. However, annotating the Mean opinion score (MOS) for videos is expensive and time-consuming, which limits the scale of VQA datasets, and poses a significant obstacle for deep learning-based methods. In this paper, we propose a VQA method named PTM-VQA, which leverages PreTrained Models to transfer knowledge from models pretrained on various pre-tasks, enabling benefits for VQA from different aspects. Specifically, we extract features of videos from different pretrained models with frozen weights and integrate them to generate representation. Since these models possess various fields of knowledge and are often trained with labels irrelevant to quality, we propose an Intra-Consistency and Inter-Divisibility (ICID) loss to impose constraints on features extracted by multiple pretrained models. The intra-consistency constraint ensures that features extracted by different pretrained models are in the same unified quality-aware latent space, while the inter-divisibility introduces pseudo clusters based on the annotation of samples and tries to separate features of samples from different clusters. Furthermore, with a constantly growing number of pretrained models, it is crucial to determine which models to use and how to use them. To address this problem, we propose an efficient scheme to select suitable candidates. Models with better clustering performance on VQA datasets are chosen to be our candidates. Extensive experiments demonstrate the effectiveness of the proposed method.

Create account to get full access

Overview

This paper proposes a novel method called PTM-VQA for efficient video quality assessment by leveraging diverse pre-trained models from various domains.
The key idea is to utilize a wide range of pre-trained models, including those trained on images, text, and video, to extract features that can be combined to predict video quality.
This approach aims to improve the performance and efficiency of video quality assessment compared to existing methods.

Plain English Explanation

The paper is about a new way to evaluate the quality of videos. The traditional methods for assessing video quality can be time-consuming and require specialized expertise. The researchers propose a more efficient approach called PTM-VQA that leverages a diverse set of pre-trained models from different domains, such as images, text, and video.

The key insight is that these pre-trained models, which have been trained on large datasets for various tasks, can capture useful visual, semantic, and temporal features that are relevant for predicting video quality. By combining the outputs of these diverse models, the researchers can make accurate quality assessments without needing to train a video-specific model from scratch.

This approach aims to be more efficient and practical than existing methods, which may require specialized knowledge or a lot of computational resources. The researchers demonstrate that PTM-VQA can achieve competitive performance on standard video quality datasets while being much faster and more convenient to use.

Technical Explanation

The paper introduces PTM-VQA, a novel method for video quality assessment that leverages a diverse set of pre-trained models from different domains. The key idea is to utilize the feature representations learned by these pre-trained models, which have been trained on large datasets for various tasks, to predict video quality without the need for extensive fine-tuning or model training.

The PTM-VQA framework consists of three main components:

Feature Extraction: The input video is passed through a suite of pre-trained models, including image classification models, text understanding models, and video-based models, to extract a rich set of features.
Feature Fusion: The features from the various pre-trained models are then combined using a fusion mechanism to create a comprehensive representation of the video.
Quality Prediction: The fused features are used as input to a lightweight regression model to predict the final video quality score.

The researchers evaluate PTM-VQA on several standard video quality assessment datasets and compare its performance to state-of-the-art methods. The results show that PTM-VQA can achieve competitive or even superior quality prediction accuracy while being significantly more efficient in terms of computational cost and model complexity.

Critical Analysis

The researchers have proposed an interesting and practical approach to video quality assessment by leveraging a diverse set of pre-trained models. The key strength of this method is its ability to leverage a wide range of existing models without the need for extensive training or fine-tuning, which can make the process more efficient and accessible.

However, the paper does not provide a detailed analysis of the trade-offs between the various pre-trained models and their individual contributions to the final quality prediction. It would be valuable to understand which types of models are most important for this task and how the model selection and fusion process can be further optimized.

Additionally, the paper does not address the potential limitations of using pre-trained models, such as the risk of domain shift or the potential for biases in the original training data. Further research could explore ways to mitigate these issues and ensure the robustness of the PTM-VQA approach.

Conclusion

The PTM-VQA method proposed in this paper represents an interesting and practical approach to video quality assessment. By leveraging a diverse set of pre-trained models, the researchers have developed a more efficient and accessible solution compared to traditional methods that require extensive training or specialized expertise.

The promising results demonstrated in the paper suggest that this approach could have significant practical applications, particularly in scenarios where real-time or resource-constrained video quality assessment is required. Further research and development in this direction could lead to more robust and versatile video quality assessment tools that can benefit a wide range of industries and applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🔄

Analysis of Video Quality Datasets via Design of Minimalistic Video Quality Models

Wei Sun, Wen Wen, Xiongkuo Min, Long Lan, Guangtao Zhai, Kede Ma

Blind video quality assessment (BVQA) plays an indispensable role in monitoring and improving the end-users' viewing experience in various real-world video-enabled media applications. As an experimental field, the improvements of BVQA models have been measured primarily on a few human-rated VQA datasets. Thus, it is crucial to gain a better understanding of existing VQA datasets in order to properly evaluate the current progress in BVQA. Towards this goal, we conduct a first-of-its-kind computational analysis of VQA datasets via designing minimalistic BVQA models. By minimalistic, we restrict our family of BVQA models to build only upon basic blocks: a video preprocessor (for aggressive spatiotemporal downsampling), a spatial quality analyzer, an optional temporal quality analyzer, and a quality regressor, all with the simplest possible instantiations. By comparing the quality prediction performance of different model variants on eight VQA datasets with realistic distortions, we find that nearly all datasets suffer from the easy dataset problem of varying severity, some of which even admit blind image quality assessment (BIQA) solutions. We additionally justify our claims by contrasting our model generalizability on these VQA datasets, and by ablating a dizzying set of BVQA design choices related to the basic building blocks. Our results cast doubt on the current progress in BVQA, and meanwhile shed light on good practices of constructing next-generation VQA datasets and models.

4/4/2024

cs.CV cs.MM eess.IV

Enhancing Blind Video Quality Assessment with Rich Quality-aware Features

Wei Sun, Haoning Wu, Zicheng Zhang, Jun Jia, Zhichao Zhang, Linhan Cao, Qiubo Chen, Xiongkuo Min, Weisi Lin, Guangtao Zhai

In this paper, we present a simple but effective method to enhance blind video quality assessment (BVQA) models for social media videos. Motivated by previous researches that leverage pre-trained features extracted from various computer vision models as the feature representation for BVQA, we further explore rich quality-aware features from pre-trained blind image quality assessment (BIQA) and BVQA models as auxiliary features to help the BVQA model to handle complex distortions and diverse content of social media videos. Specifically, we use SimpleVQA, a BVQA model that consists of a trainable Swin Transformer-B and a fixed SlowFast, as our base model. The Swin Transformer-B and SlowFast components are responsible for extracting spatial and motion features, respectively. Then, we extract three kinds of features from Q-Align, LIQE, and FAST-VQA to capture frame-level quality-aware features, frame-level quality-aware along with scene-specific features, and spatiotemporal quality-aware features, respectively. Through concatenating these features, we employ a multi-layer perceptron (MLP) network to regress them into quality scores. Experimental results demonstrate that the proposed model achieves the best performance on three public social media VQA datasets. Moreover, the proposed model won first place in the CVPR NTIRE 2024 Short-form UGC Video Quality Assessment Challenge. The code is available at url{https://github.com/sunwei925/RQ-VQA.git}.

5/15/2024

eess.IV cs.CV cs.MM

CLIP-Guided Attribute Aware Pretraining for Generalizable Image Quality Assessment

Daekyu Kwon, Dongyoung Kim, Sehwan Ki, Younghyun Jo, Hyong-Euk Lee, Seon Joo Kim

In no-reference image quality assessment (NR-IQA), the challenge of limited dataset sizes hampers the development of robust and generalizable models. Conventional methods address this issue by utilizing large datasets to extract rich representations for IQA. Also, some approaches propose vision language models (VLM) based IQA, but the domain gap between generic VLM and IQA constrains their scalability. In this work, we propose a novel pretraining framework that constructs a generalizable representation for IQA by selectively extracting quality-related knowledge from VLM and leveraging the scalability of large datasets. Specifically, we carefully select optimal text prompts for five representative image quality attributes and use VLM to generate pseudo-labels. Numerous attribute-aware pseudo-labels can be generated with large image datasets, allowing our IQA model to learn rich representations about image quality. Our approach achieves state-of-the-art performance on multiple IQA datasets and exhibits remarkable generalization capabilities. Leveraging these strengths, we propose several applications, such as evaluating image generation models and training image enhancement models, demonstrating our model's real-world applicability. We will make the code available for access.

6/4/2024

cs.CV

🤖

Multiview Contrastive Learning for Completely Blind Video Quality Assessment of User Generated Content

Shankhanil Mitra, Rajiv Soundararajan

Completely blind video quality assessment (VQA) refers to a class of quality assessment methods that do not use any reference videos, human opinion scores or training videos from the target database to learn a quality model. The design of this class of methods is particularly important since it can allow for superior generalization in performance across various datasets. We consider the design of completely blind VQA for user generated content. While several deep feature extraction methods have been considered in supervised and weakly supervised settings, such approaches have not been studied in the context of completely blind VQA. We bridge this gap by presenting a self-supervised multiview contrastive learning framework to learn spatio-temporal quality representations. In particular, we capture the common information between frame differences and frames by treating them as a pair of views and similarly obtain the shared representations between frame differences and optical flow. The resulting features are then compared with a corpus of pristine natural video patches to predict the quality of the distorted video. Detailed experiments on multiple camera captured VQA datasets reveal the superior performance of our method over other features when evaluated without training on human scores.

6/25/2024

eess.IV