RMT-BVQA: Recurrent Memory Transformer-based Blind Video Quality Assessment for Enhanced Video Content

Read original: arXiv:2405.08621 - Published 9/9/2024 by Tianhao Peng, Chen Feng, Duolikun Danier, Fan Zhang, Benoit Vallade, Alex Mackin, David Bull

🗣️

Overview

The paper proposes a novel blind deep video quality assessment (VQA) method for evaluating the quality of enhanced video content.
It uses a Recurrent Memory Transformer (RMT) network architecture to obtain video quality representations, which are trained using a content-quality-aware contrastive learning strategy.
The method, called RMT-BVQA, is evaluated on the VDPVE database and shows superior correlation performance compared to 10 existing no-reference quality metrics.

Plain English Explanation

When videos are edited or enhanced, it's important to assess their quality. However, existing quality assessment methods were designed for compressed videos, not enhanced ones. This paper proposes a new way to evaluate the quality of enhanced videos.

The key idea is to use a deep learning model that can understand the unique characteristics of enhanced videos. The model, called RMT-BVQA, uses a special type of neural network architecture called a Recurrent Memory Transformer. This allows the model to learn representations of video quality that are tailored to enhanced content.

The model is trained on a new database of 13,000 video patches that have been enhanced in various ways. This "content-quality-aware" training approach helps the model learn what good quality looks like for enhanced videos.

When evaluating a new video, RMT-BVQA extracts quality representations from the video and combines them to produce an overall quality score. Testing on a benchmark dataset shows this method outperforms 10 other quality assessment approaches that weren't designed for enhanced videos.

Technical Explanation

The paper presents a novel blind video quality assessment (VQA) method for evaluating the quality of enhanced video content. Existing VQA metrics were primarily developed for compressed videos and may not accurately capture the perceptual quality of enhanced content.

The proposed method, called RMT-BVQA, employs a Recurrent Memory Transformer (RMT) network architecture to obtain video quality representations. The RMT module uses a memory mechanism to selectively aggregate information across frames, allowing it to capture temporal dependencies relevant to video quality.

The RMT-BVQA model is trained using a content-quality-aware contrastive learning strategy based on a new database of 13,000 video patches with enhanced content. This training approach helps the model learn quality representations that are tailored to the characteristics of enhanced videos.

The extracted quality representations are then combined through linear regression to generate video-level quality indices. RMT-BVQA is evaluated on the VDPVE (VQA Dataset for Perceptual Video Enhancement) database using five-fold cross-validation. The results demonstrate the method's superior correlation performance compared to 10 existing no-reference quality metrics.

Critical Analysis

The paper addresses an important issue in video quality assessment, as the quality of enhanced videos is not well captured by existing VQA metrics. The proposed RMT-BVQA method shows promising results, but there are a few potential limitations and areas for further research:

The new VDPVE database used for training and evaluation may not be representative of all types of enhanced video content. Evaluating the method on a more diverse set of enhanced videos would provide a more comprehensive assessment.
The linear regression approach for combining quality representations may oversimplify the relationship between the representations and the final quality score. Exploring more sophisticated fusion techniques could potentially improve performance.
The paper does not provide much insight into the interpretability of the quality representations learned by the RMT module. Understanding how the model arrives at its quality assessments could be valuable for practical applications.

Overall, the RMT-BVQA method is a promising step towards better quality assessment for enhanced video content, but further research and validation would be useful to fully understand its capabilities and limitations.

Conclusion

The paper proposes a novel blind deep video quality assessment (VQA) method, RMT-BVQA, that is specifically designed for evaluating the quality of enhanced video content. By employing a Recurrent Memory Transformer network architecture and a content-quality-aware training approach, the method is able to learn video quality representations that are well-suited for enhanced videos.

Evaluation on the VDPVE database demonstrates the superior performance of RMT-BVQA compared to 10 existing no-reference quality metrics. This is an important advancement, as the quality of enhanced videos is not well captured by existing VQA methods.

The proposed approach has the potential to improve the assessment and optimization of video enhancement algorithms, leading to better-quality video experiences for users. Further research to address the identified limitations and expand the method's capabilities could further strengthen its impact on the field of video quality assessment.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🗣️

RMT-BVQA: Recurrent Memory Transformer-based Blind Video Quality Assessment for Enhanced Video Content

Tianhao Peng, Chen Feng, Duolikun Danier, Fan Zhang, Benoit Vallade, Alex Mackin, David Bull

With recent advances in deep learning, numerous algorithms have been developed to enhance video quality, reduce visual artifacts, and improve perceptual quality. However, little research has been reported on the quality assessment of enhanced content - the evaluation of enhancement methods is often based on quality metrics that were designed for compression applications. In this paper, we propose a novel blind deep video quality assessment (VQA) method specifically for enhanced video content. It employs a new Recurrent Memory Transformer (RMT) based network architecture to obtain video quality representations, which is optimized through a novel content-quality-aware contrastive learning strategy based on a new database containing 13K training patches with enhanced content. The extracted quality representations are then combined through linear regression to generate video-level quality indices. The proposed method, RMT-BVQA, has been evaluated on the VDPVE (VQA Dataset for Perceptual Video Enhancement) database through a five-fold cross validation. The results show its superior correlation performance when compared to ten existing no-reference quality metrics.

9/9/2024

Enhancing Blind Video Quality Assessment with Rich Quality-aware Features

Wei Sun, Haoning Wu, Zicheng Zhang, Jun Jia, Zhichao Zhang, Linhan Cao, Qiubo Chen, Xiongkuo Min, Weisi Lin, Guangtao Zhai

In this paper, we present a simple but effective method to enhance blind video quality assessment (BVQA) models for social media videos. Motivated by previous researches that leverage pre-trained features extracted from various computer vision models as the feature representation for BVQA, we further explore rich quality-aware features from pre-trained blind image quality assessment (BIQA) and BVQA models as auxiliary features to help the BVQA model to handle complex distortions and diverse content of social media videos. Specifically, we use SimpleVQA, a BVQA model that consists of a trainable Swin Transformer-B and a fixed SlowFast, as our base model. The Swin Transformer-B and SlowFast components are responsible for extracting spatial and motion features, respectively. Then, we extract three kinds of features from Q-Align, LIQE, and FAST-VQA to capture frame-level quality-aware features, frame-level quality-aware along with scene-specific features, and spatiotemporal quality-aware features, respectively. Through concatenating these features, we employ a multi-layer perceptron (MLP) network to regress them into quality scores. Experimental results demonstrate that the proposed model achieves the best performance on three public social media VQA datasets. Moreover, the proposed model won first place in the CVPR NTIRE 2024 Short-form UGC Video Quality Assessment Challenge. The code is available at url{https://github.com/sunwei925/RQ-VQA.git}.

5/15/2024

🤖

Multiview Contrastive Learning for Completely Blind Video Quality Assessment of User Generated Content

Shankhanil Mitra, Rajiv Soundararajan

Completely blind video quality assessment (VQA) refers to a class of quality assessment methods that do not use any reference videos, human opinion scores or training videos from the target database to learn a quality model. The design of this class of methods is particularly important since it can allow for superior generalization in performance across various datasets. We consider the design of completely blind VQA for user generated content. While several deep feature extraction methods have been considered in supervised and weakly supervised settings, such approaches have not been studied in the context of completely blind VQA. We bridge this gap by presenting a self-supervised multiview contrastive learning framework to learn spatio-temporal quality representations. In particular, we capture the common information between frame differences and frames by treating them as a pair of views and similarly obtain the shared representations between frame differences and optical flow. The resulting features are then compared with a corpus of pristine natural video patches to predict the quality of the distorted video. Detailed experiments on multiple camera captured VQA datasets reveal the superior performance of our method over other features when evaluated without training on human scores.

6/25/2024

CLIPVQA:Video Quality Assessment via CLIP

Fengchuang Xing, Mingjie Li, Yuan-Gen Wang, Guopu Zhu, Xiaochun Cao

In learning vision-language representations from web-scale data, the contrastive language-image pre-training (CLIP) mechanism has demonstrated a remarkable performance in many vision tasks. However, its application to the widely studied video quality assessment (VQA) task is still an open issue. In this paper, we propose an efficient and effective CLIP-based Transformer method for the VQA problem (CLIPVQA). Specifically, we first design an effective video frame perception paradigm with the goal of extracting the rich spatiotemporal quality and content information among video frames. Then, the spatiotemporal quality features are adequately integrated together using a self-attention mechanism to yield video-level quality representation. To utilize the quality language descriptions of videos for supervision, we develop a CLIP-based encoder for language embedding, which is then fully aggregated with the generated content information via a cross-attention module for producing video-language representation. Finally, the video-level quality and video-language representations are fused together for final video quality prediction, where a vectorized regression loss is employed for efficient end-to-end optimization. Comprehensive experiments are conducted on eight in-the-wild video datasets with diverse resolutions to evaluate the performance of CLIPVQA. The experimental results show that the proposed CLIPVQA achieves new state-of-the-art VQA performance and up to 37% better generalizability than existing benchmark VQA methods. A series of ablation studies are also performed to validate the effectiveness of each module in CLIPVQA.

7/9/2024