Multiview Contrastive Learning for Completely Blind Video Quality Assessment of User Generated Content

2207.06148

Published 6/25/2024 by Shankhanil Mitra, Rajiv Soundararajan

🤖

Abstract

Completely blind video quality assessment (VQA) refers to a class of quality assessment methods that do not use any reference videos, human opinion scores or training videos from the target database to learn a quality model. The design of this class of methods is particularly important since it can allow for superior generalization in performance across various datasets. We consider the design of completely blind VQA for user generated content. While several deep feature extraction methods have been considered in supervised and weakly supervised settings, such approaches have not been studied in the context of completely blind VQA. We bridge this gap by presenting a self-supervised multiview contrastive learning framework to learn spatio-temporal quality representations. In particular, we capture the common information between frame differences and frames by treating them as a pair of views and similarly obtain the shared representations between frame differences and optical flow. The resulting features are then compared with a corpus of pristine natural video patches to predict the quality of the distorted video. Detailed experiments on multiple camera captured VQA datasets reveal the superior performance of our method over other features when evaluated without training on human scores.

Create account to get full access

Overview

This paper presents a self-supervised multiview contrastive learning framework to learn spatio-temporal quality representations for completely blind video quality assessment (VQA).
Completely blind VQA refers to quality assessment methods that do not use any reference videos, human opinion scores, or training videos from the target database.
The authors focus on applying this approach to user-generated content, where several deep feature extraction methods have been explored in supervised and weakly supervised settings but not in the completely blind VQA context.

Plain English Explanation

The paper discusses a new way to assess the quality of videos without using any reference videos, human ratings, or training data from the specific videos being evaluated. This is particularly useful for assessing the quality of user-generated content, where a lot of videos may be created without any reference material available.

The key idea is to use a self-supervised multiview contrastive learning framework to learn features that capture the spatial and temporal information in the videos. This involves treating different representations of the video, like frame differences and optical flow, as different "views" of the same content and learning features that capture the common information between these views.

These learned features are then compared to a collection of high-quality natural video patches to predict the quality of the distorted video. This approach does not require any human-provided quality scores or reference videos, which makes it very flexible and able to generalize well to different datasets.

Technical Explanation

The authors present a self-supervised multiview contrastive learning framework to learn spatio-temporal quality representations for completely blind VQA. They capture the common information between frame differences and frames, as well as between frame differences and optical flow, and use these shared representations to predict the quality of distorted videos.

Specifically, they treat frame differences and frames as a pair of views, and similarly obtain shared representations between frame differences and optical flow. These learned features are then compared to a corpus of pristine natural video patches to predict the quality of the distorted video.

The authors evaluate their method on multiple camera-captured VQA datasets and show that it outperforms other feature-based approaches when evaluated without using any training on human scores. This demonstrates the superior generalization of their completely blind VQA approach.

Critical Analysis

The paper presents a novel and promising approach to completely blind VQA, which is an important problem as it can enable quality assessment in scenarios where reference videos or human ratings are not available. The authors' use of self-supervised multiview contrastive learning is a well-justified and technically sound approach.

One potential limitation is that the method still requires a corpus of high-quality natural video patches, which may not always be readily available. Additionally, the paper does not explore the sensitivity of the method to the size or composition of this reference corpus.

Further research could investigate the effect of sharpness on the performance of completely blind VQA methods, as well as explore the use of recurrent memory transformers or learned scanpaths to enhance the quality representations.

Additionally, a more in-depth analysis of video quality datasets used in the evaluation could provide further insights into the strengths and limitations of the proposed approach.

Conclusion

This paper presents a novel self-supervised multiview contrastive learning framework for completely blind video quality assessment. By learning spatio-temporal quality representations without relying on reference videos, human opinion scores, or training data from the target database, the proposed method demonstrates superior generalization performance across various datasets.

The ability to perform quality assessment without any reference material is a significant advancement, as it enables the application of VQA techniques to a wider range of user-generated content scenarios. The authors' work contributes to the growing field of completely blind VQA and paves the way for further research into enhancing the robustness and versatility of video quality assessment methods.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Enhancing Blind Video Quality Assessment with Rich Quality-aware Features

Wei Sun, Haoning Wu, Zicheng Zhang, Jun Jia, Zhichao Zhang, Linhan Cao, Qiubo Chen, Xiongkuo Min, Weisi Lin, Guangtao Zhai

In this paper, we present a simple but effective method to enhance blind video quality assessment (BVQA) models for social media videos. Motivated by previous researches that leverage pre-trained features extracted from various computer vision models as the feature representation for BVQA, we further explore rich quality-aware features from pre-trained blind image quality assessment (BIQA) and BVQA models as auxiliary features to help the BVQA model to handle complex distortions and diverse content of social media videos. Specifically, we use SimpleVQA, a BVQA model that consists of a trainable Swin Transformer-B and a fixed SlowFast, as our base model. The Swin Transformer-B and SlowFast components are responsible for extracting spatial and motion features, respectively. Then, we extract three kinds of features from Q-Align, LIQE, and FAST-VQA to capture frame-level quality-aware features, frame-level quality-aware along with scene-specific features, and spatiotemporal quality-aware features, respectively. Through concatenating these features, we employ a multi-layer perceptron (MLP) network to regress them into quality scores. Experimental results demonstrate that the proposed model achieves the best performance on three public social media VQA datasets. Moreover, the proposed model won first place in the CVPR NTIRE 2024 Short-form UGC Video Quality Assessment Challenge. The code is available at url{https://github.com/sunwei925/RQ-VQA.git}.

5/15/2024

eess.IV cs.CV cs.MM

🗣️

RMT-BVQA: Recurrent Memory Transformer-based Blind Video Quality Assessment for Enhanced Video Content

Tianhao Peng, Chen Feng, Duolikun Danier, Fan Zhang, David Bull

With recent advances in deep learning, numerous algorithms have been developed to enhance video quality, reduce visual artefacts and improve perceptual quality. However, little research has been reported on the quality assessment of enhanced content - the evaluation of enhancement methods is often based on quality metrics that were designed for compression applications. In this paper, we propose a novel blind deep video quality assessment (VQA) method specifically for enhanced video content. It employs a new Recurrent Memory Transformer (RMT) based network architecture to obtain video quality representations, which is optimised through a novel content-quality-aware contrastive learning strategy based on a new database containing 13K training patches with enhanced content. The extracted quality representations are then combined through linear regression to generate video-level quality indices. The proposed method, RMT-BVQA, has been evaluated on the VDPVE (VQA Dataset for Perceptual Video Enhancement) database through a five-fold cross validation. The results show its superior correlation performance when compared to ten existing no-reference quality metrics.

5/16/2024

eess.IV cs.CV

🔄

Analysis of Video Quality Datasets via Design of Minimalistic Video Quality Models

Wei Sun, Wen Wen, Xiongkuo Min, Long Lan, Guangtao Zhai, Kede Ma

Blind video quality assessment (BVQA) plays an indispensable role in monitoring and improving the end-users' viewing experience in various real-world video-enabled media applications. As an experimental field, the improvements of BVQA models have been measured primarily on a few human-rated VQA datasets. Thus, it is crucial to gain a better understanding of existing VQA datasets in order to properly evaluate the current progress in BVQA. Towards this goal, we conduct a first-of-its-kind computational analysis of VQA datasets via designing minimalistic BVQA models. By minimalistic, we restrict our family of BVQA models to build only upon basic blocks: a video preprocessor (for aggressive spatiotemporal downsampling), a spatial quality analyzer, an optional temporal quality analyzer, and a quality regressor, all with the simplest possible instantiations. By comparing the quality prediction performance of different model variants on eight VQA datasets with realistic distortions, we find that nearly all datasets suffer from the easy dataset problem of varying severity, some of which even admit blind image quality assessment (BIQA) solutions. We additionally justify our claims by contrasting our model generalizability on these VQA datasets, and by ablating a dizzying set of BVQA design choices related to the basic building blocks. Our results cast doubt on the current progress in BVQA, and meanwhile shed light on good practices of constructing next-generation VQA datasets and models.

4/4/2024

cs.CV cs.MM eess.IV

🏋️

Study of the effect of Sharpness on Blind Video Quality Assessment

Anantha Prabhu, David Pratap, Narayana Darapeni, Anwesh P R

Introduction: Video Quality Assessment (VQA) is one of the important areas of study in this modern era, where video is a crucial component of communication with applications in every field. Rapid technology developments in mobile technology enabled anyone to create videos resulting in a varied range of video quality scenarios. Objectives: Though VQA was present for some time with the classical metrices like SSIM and PSNR, the advent of machine learning has brought in new techniques of VQAs which are built upon Convolutional Neural Networks (CNNs) or Deep Neural Networks (DNNs). Methods: Over the past years various research studies such as the BVQA which performed video quality assessment of nature-based videos using DNNs exposed the powerful capabilities of machine learning algorithms. BVQA using DNNs explored human visual system effects such as content dependency and time-related factors normally known as temporal effects. Results: This study explores the sharpness effect on models like BVQA. Sharpness is the measure of the clarity and details of the video image. Sharpness typically involves analyzing the edges and contrast of the image to determine the overall level of detail and sharpness. Conclusion: This study uses the existing video quality databases such as CVD2014. A comparative study of the various machine learning parameters such as SRCC and PLCC during the training and testing are presented along with the conclusion.

4/10/2024

eess.IV cs.CV