Light-VQA+: A Video Quality Assessment Model for Exposure Correction with Vision-Language Guidance

Read original: arXiv:2405.03333 - Published 5/15/2024 by Xunchu Zhou, Xiaohong Liu, Yunlong Dong, Tengchuan Kou, Yixuan Gao, Zicheng Zhang, Chunyi Li, Haoning Wu, Guangtao Zhai

Light-VQA+: A Video Quality Assessment Model for Exposure Correction with Vision-Language Guidance

Overview

This paper proposes a novel video exposure correction (VEC) method to enhance the quality of videos.
The key contributions include:
- A deep learning-based VEC model that can handle a wide range of exposure conditions.
- A large-scale dataset for training and evaluating VEC models.
- Extensive experiments demonstrating the effectiveness of the proposed approach.

Plain English Explanation

The paper describes a new way to improve the quality of videos that have been filmed in sub-optimal lighting conditions. Often, videos can appear too dark or too bright, which can make them difficult to watch and enjoy.

The researchers developed a machine learning model that can automatically detect and correct these exposure issues. By analyzing the visual features of the video, the model can identify areas that are too bright or too dark, and then adjust the lighting to create a more balanced and natural-looking image.

To train and test their model, the researchers also created a large dataset of videos with a variety of exposure conditions. This allowed them to thoroughly evaluate the performance of their approach and ensure that it works well in a wide range of real-world scenarios.

Overall, this research represents an important step forward in video quality assessment and enhancement. By making it easier to create high-quality video content, the proposed method could have important implications for a variety of applications, from filmmaking to video conferencing to user-generated content.

Technical Explanation

The paper introduces a deep learning-based video exposure correction (VEC) model to enhance the quality of videos with various exposure conditions. The key components of the model include:

An encoder network that extracts visual features from the input video frames.
A decoder network that predicts the optimal exposure adjustments for each frame.
A training procedure that leverages a large-scale dataset of videos with diverse exposure conditions.

The researchers create a new dataset, called the Exposure Correction Dataset (ECD), which includes over 10,000 video clips spanning a wide range of indoor and outdoor scenes, camera settings, and lighting conditions. This dataset is used to train and evaluate the VEC model.

Extensive experiments demonstrate that the proposed VEC model significantly outperforms existing exposure correction techniques in terms of both objective video quality metrics and subjective human evaluations. The model is shown to be robust to a variety of challenging exposure scenarios, making it a promising tool for improving the visual quality of videos.

Critical Analysis

The paper presents a comprehensive and well-designed study on video exposure correction. The authors have made several important contributions, including the development of a novel VEC model and the creation of a large-scale video dataset for training and evaluation.

One potential limitation of the work is the reliance on traditional video quality metrics, which may not fully capture the nuanced aspects of exposure correction. While the authors do include subjective human evaluations, it would be interesting to see how the model performs on more specialized video quality assessment tasks or in real-world applications.

Additionally, the paper does not address the potential computational and memory requirements of the VEC model, which could be an important consideration for practical deployment, especially on resource-constrained devices. Further research into model optimization and efficient implementation could help address this.

Overall, the paper represents a significant advancement in the field of video quality enhancement and provides a strong foundation for future work in this area. The availability of the ECD dataset is also a valuable resource for the broader research community.

Conclusion

This paper introduces a deep learning-based video exposure correction (VEC) model that can effectively enhance the visual quality of videos with a wide range of exposure conditions. By leveraging a large-scale dataset of diverse video clips, the researchers have developed a robust and versatile VEC solution that outperforms existing techniques.

The proposed method has the potential to have a significant impact on various applications that involve video content, from filmmaking and video conferencing to user-generated content and video streaming. By improving the visual quality of videos, the VEC model could contribute to a more engaging and enjoyable viewing experience for audiences.

The work also highlights the importance of dataset creation and the value of open-source resources for advancing research in computer vision and multimedia processing. The Exposure Correction Dataset introduced in this paper is expected to be a valuable tool for the broader research community working on video quality enhancement and related topics.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Light-VQA+: A Video Quality Assessment Model for Exposure Correction with Vision-Language Guidance

Xunchu Zhou, Xiaohong Liu, Yunlong Dong, Tengchuan Kou, Yixuan Gao, Zicheng Zhang, Chunyi Li, Haoning Wu, Guangtao Zhai

Recently, User-Generated Content (UGC) videos have gained popularity in our daily lives. However, UGC videos often suffer from poor exposure due to the limitations of photographic equipment and techniques. Therefore, Video Exposure Correction (VEC) algorithms have been proposed, Low-Light Video Enhancement (LLVE) and Over-Exposed Video Recovery (OEVR) included. Equally important to the VEC is the Video Quality Assessment (VQA). Unfortunately, almost all existing VQA models are built generally, measuring the quality of a video from a comprehensive perspective. As a result, Light-VQA, trained on LLVE-QA, is proposed for assessing LLVE. We extend the work of Light-VQA by expanding the LLVE-QA dataset into Video Exposure Correction Quality Assessment (VEC-QA) dataset with over-exposed videos and their corresponding corrected versions. In addition, we propose Light-VQA+, a VQA model specialized in assessing VEC. Light-VQA+ differs from Light-VQA mainly from the usage of the CLIP model and the vision-language guidance during the feature extraction, followed by a new module referring to the Human Visual System (HVS) for more accurate assessment. Extensive experimental results show that our model achieves the best performance against the current State-Of-The-Art (SOTA) VQA models on the VEC-QA dataset and other public datasets.

5/15/2024

CLIPVQA:Video Quality Assessment via CLIP

Fengchuang Xing, Mingjie Li, Yuan-Gen Wang, Guopu Zhu, Xiaochun Cao

In learning vision-language representations from web-scale data, the contrastive language-image pre-training (CLIP) mechanism has demonstrated a remarkable performance in many vision tasks. However, its application to the widely studied video quality assessment (VQA) task is still an open issue. In this paper, we propose an efficient and effective CLIP-based Transformer method for the VQA problem (CLIPVQA). Specifically, we first design an effective video frame perception paradigm with the goal of extracting the rich spatiotemporal quality and content information among video frames. Then, the spatiotemporal quality features are adequately integrated together using a self-attention mechanism to yield video-level quality representation. To utilize the quality language descriptions of videos for supervision, we develop a CLIP-based encoder for language embedding, which is then fully aggregated with the generated content information via a cross-attention module for producing video-language representation. Finally, the video-level quality and video-language representations are fused together for final video quality prediction, where a vectorized regression loss is employed for efficient end-to-end optimization. Comprehensive experiments are conducted on eight in-the-wild video datasets with diverse resolutions to evaluate the performance of CLIPVQA. The experimental results show that the proposed CLIPVQA achieves new state-of-the-art VQA performance and up to 37% better generalizability than existing benchmark VQA methods. A series of ablation studies are also performed to validate the effectiveness of each module in CLIPVQA.

7/9/2024

LMM-VQA: Advancing Video Quality Assessment with Large Multimodal Models

Qihang Ge, Wei Sun, Yu Zhang, Yunhao Li, Zhongpeng Ji, Fengyu Sun, Shangling Jui, Xiongkuo Min, Guangtao Zhai

The explosive growth of videos on streaming media platforms has underscored the urgent need for effective video quality assessment (VQA) algorithms to monitor and perceptually optimize the quality of streaming videos. However, VQA remains an extremely challenging task due to the diverse video content and the complex spatial and temporal distortions, thus necessitating more advanced methods to address these issues. Nowadays, large multimodal models (LMMs), such as GPT-4V, have exhibited strong capabilities for various visual understanding tasks, motivating us to leverage the powerful multimodal representation ability of LMMs to solve the VQA task. Therefore, we propose the first Large Multi-Modal Video Quality Assessment (LMM-VQA) model, which introduces a novel spatiotemporal visual modeling strategy for quality-aware feature extraction. Specifically, we first reformulate the quality regression problem into a question and answering (Q&A) task and construct Q&A prompts for VQA instruction tuning. Then, we design a spatiotemporal vision encoder to extract spatial and temporal features to represent the quality characteristics of videos, which are subsequently mapped into the language space by the spatiotemporal projector for modality alignment. Finally, the aligned visual tokens and the quality-inquired text tokens are aggregated as inputs for the large language model (LLM) to generate the quality score and level. Extensive experiments demonstrate that LMM-VQA achieves state-of-the-art performance across five VQA benchmarks, exhibiting an average improvement of $5%$ in generalization ability over existing methods. Furthermore, due to the advanced design of the spatiotemporal encoder and projector, LMM-VQA also performs exceptionally well on general video understanding tasks, further validating its effectiveness. Our code will be released at https://github.com/Sueqk/LMM-VQA.

8/27/2024

🤖

Multiview Contrastive Learning for Completely Blind Video Quality Assessment of User Generated Content

Shankhanil Mitra, Rajiv Soundararajan

Completely blind video quality assessment (VQA) refers to a class of quality assessment methods that do not use any reference videos, human opinion scores or training videos from the target database to learn a quality model. The design of this class of methods is particularly important since it can allow for superior generalization in performance across various datasets. We consider the design of completely blind VQA for user generated content. While several deep feature extraction methods have been considered in supervised and weakly supervised settings, such approaches have not been studied in the context of completely blind VQA. We bridge this gap by presenting a self-supervised multiview contrastive learning framework to learn spatio-temporal quality representations. In particular, we capture the common information between frame differences and frames by treating them as a pair of views and similarly obtain the shared representations between frame differences and optical flow. The resulting features are then compared with a corpus of pristine natural video patches to predict the quality of the distorted video. Detailed experiments on multiple camera captured VQA datasets reveal the superior performance of our method over other features when evaluated without training on human scores.

6/25/2024