Vision Language Modeling of Content, Distortion and Appearance for Image Quality Assessment

Read original: arXiv:2406.09858 - Published 6/24/2024 by Fei Zhou, Zhicong Huang, Tianhao Gu, Guoping Qiu

Vision Language Modeling of Content, Distortion and Appearance for Image Quality Assessment

Overview

This paper presents a vision-language modeling approach for image quality assessment that considers content, distortion, and appearance.
The proposed model aims to provide a more comprehensive and accurate evaluation of image quality compared to existing methods.
The researchers leverage pre-trained vision-language models to capture the visual and textual information relevant to image quality.

Plain English Explanation

The researchers have developed a new way to evaluate the quality of images. Traditionally, image quality assessment has focused on factors like distortion or visual artifacts. However, the quality of an image is influenced by a variety of elements, including the actual content, how the image is distorted or altered, and the overall appearance.

This paper introduces a model that uses vision-language modeling to assess image quality more holistically. The key idea is to leverage powerful language models that have been pre-trained on a vast amount of text data. These models can then be used to analyze the visual and textual information in an image and provide a more comprehensive evaluation of its quality.

For example, the model might look at the objects and scenes depicted in the image (the content), any blurriness or artifacts present (the distortion), and the overall aesthetics and style (the appearance). By considering all of these factors, the researchers believe their approach can provide a more nuanced and accurate assessment of image quality.

This work builds on related research in the field of image quality assessment, as well as recent advancements in vision-language modeling and multi-modal machine learning.

Technical Explanation

The researchers propose a vision-language modeling approach for image quality assessment that considers content, distortion, and appearance. They leverage pre-trained vision-language models, such as CLIP and UniQA, to capture the relevant visual and textual information.

The model takes an input image and generates a quality score based on three key components:

Content: The model evaluates the semantic content of the image, assessing the objects, scenes, and activities depicted.
Distortion: The model analyzes the image for visual artifacts, such as blurriness, noise, or compression artifacts, that may degrade the quality.
Appearance: The model considers the overall aesthetic qualities of the image, including factors like color, lighting, and composition.

The researchers train and evaluate their model on several publicly available image quality datasets, including KADID-10k and PIPAL. The results demonstrate that the proposed vision-language approach outperforms existing image quality assessment methods across various metrics.

Critical Analysis

The researchers acknowledge several limitations and areas for future work:

The model's performance may be influenced by the quality and coverage of the pre-trained vision-language models used. Continued advancements in these foundational models could further improve the image quality assessment capabilities.
The current implementation focuses on evaluating individual images, but extending the approach to assess the quality of image sequences or videos could be a valuable direction for future research.
Incorporation of additional modalities, such as audio or sensor data, could provide additional cues for a more comprehensive assessment of image quality in real-world applications.

While the proposed vision-language modeling approach represents a promising step forward in image quality assessment, there are opportunities to further refine and expand the methodology to address these limitations and enhance the model's robustness and applicability.

Conclusion

This paper presents a novel vision-language modeling approach for image quality assessment that considers the content, distortion, and appearance of images. By leveraging pre-trained vision-language models, the proposed method can provide a more comprehensive and accurate evaluation of image quality compared to existing techniques.

The results demonstrate the effectiveness of this approach and suggest that the integration of visual and textual information can lead to significant improvements in image quality assessment. This work has the potential to impact a wide range of applications, from photography and digital imaging to media production and online content curation.

As the field of machine learning continues to advance, particularly in the area of multi-modal learning, the insights and methodologies presented in this paper may inspire further research and innovations in the assessment and understanding of visual media quality.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Vision Language Modeling of Content, Distortion and Appearance for Image Quality Assessment

Fei Zhou, Zhicong Huang, Tianhao Gu, Guoping Qiu

The visual quality of an image is confounded by a number of intertwined factors including its semantic content, distortion characteristics and appearance properties such as brightness, contrast, sharpness, and colourfulness. Distilling high level knowledge about all these quality bearing attributes is crucial for developing objective Image Quality Assessment (IQA).While existing solutions have modeled some of these aspects, a comprehensive solution that involves all these important quality related attributes has not yet been developed. In this paper, we present a new blind IQA (BIQA) model termed Self-supervision and Vision-Language supervision Image QUality Evaluator (SLIQUE) that features a joint vision-language and visual contrastive representation learning framework for acquiring high level knowledge about the images semantic contents, distortion characteristics and appearance properties for IQA. For training SLIQUE, we have developed a systematic approach to constructing a first of its kind large image database annotated with all three categories of quality relevant texts. The Text Annotated Distortion, Appearance and Content (TADAC) database has over 1.6 million images annotated with textual descriptions of their semantic contents, distortion characteristics and appearance properties. The method for constructing TADAC and the database itself will be particularly useful for exploiting vision-language modeling for advanced IQA applications. Extensive experimental results show that SLIQUE has superior performances over state of the art, demonstrating the soundness of its design principle and the effectiveness of its implementation.

6/24/2024

Descriptive Image Quality Assessment in the Wild

Zhiyuan You, Jinjin Gu, Zheyuan Li, Xin Cai, Kaiwen Zhu, Chao Dong, Tianfan Xue

With the rapid advancement of Vision Language Models (VLMs), VLM-based Image Quality Assessment (IQA) seeks to describe image quality linguistically to align with human expression and capture the multifaceted nature of IQA tasks. However, current methods are still far from practical usage. First, prior works focus narrowly on specific sub-tasks or settings, which do not align with diverse real-world applications. Second, their performance is sub-optimal due to limitations in dataset coverage, scale, and quality. To overcome these challenges, we introduce Depicted image Quality Assessment in the Wild (DepictQA-Wild). Our method includes a multi-functional IQA task paradigm that encompasses both assessment and comparison tasks, brief and detailed responses, full-reference and non-reference scenarios. We introduce a ground-truth-informed dataset construction approach to enhance data quality, and scale up the dataset to 495K under the brief-detail joint framework. Consequently, we construct a comprehensive, large-scale, and high-quality dataset, named DQ-495K. We also retain image resolution during training to better handle resolution-related quality issues, and estimate a confidence score that is helpful to filter out low-quality responses. Experimental results demonstrate that DepictQA-Wild significantly outperforms traditional score-based methods, prior VLM-based IQA models, and proprietary GPT-4V in distortion identification, instant rating, and reasoning tasks. Our advantages are further confirmed by real-world applications including assessing the web-downloaded images and ranking model-processed images. Datasets and codes will be released in https://depictqa.github.io/depictqa-wild/.

6/13/2024

UniQA: Unified Vision-Language Pre-training for Image Quality and Aesthetic Assessment

Hantao Zhou, Longxiang Tang, Rui Yang, Guanyi Qin, Yan Zhang, Runze Hu, Xiu Li

Image Quality Assessment (IQA) and Image Aesthetic Assessment (IAA) aim to simulate human subjective perception of image visual quality and aesthetic appeal. Existing methods typically address these tasks independently due to distinct learning objectives. However, they neglect the underlying interconnectedness of both tasks, which hinders the learning of task-agnostic shared representations for human subjective perception. To confront this challenge, we propose Unified vision-language pre-training of Quality and Aesthetics (UniQA), to learn general perceptions of two tasks, thereby benefiting them simultaneously. Addressing the absence of text in the IQA datasets and the presence of textual noise in the IAA datasets, (1) we utilize multimodal large language models (MLLMs) to generate high-quality text descriptions; (2) the generated text for IAA serves as metadata to purify noisy IAA data. To effectively adapt the pre-trained UniQA to downstream tasks, we further propose a lightweight adapter that utilizes versatile cues to fully exploit the extensive knowledge of the pre-trained model. Extensive experiments demonstrate that our approach attains a new state-of-the-art performance on both IQA and IAA tasks, while concurrently showcasing exceptional zero-shot and few-label image assessment capabilities. The source code will be available at https://github.com/zht8506/UniQA.

6/4/2024

ExIQA: Explainable Image Quality Assessment Using Distortion Attributes

Sepehr Kazemi Ranjbar, Emad Fatemizadeh

Blind Image Quality Assessment (BIQA) aims to develop methods that estimate the quality scores of images in the absence of a reference image. In this paper, we approach BIQA from a distortion identification perspective, where our primary goal is to predict distortion types and strengths using Vision-Language Models (VLMs), such as CLIP, due to their extensive knowledge and generalizability. Based on these predicted distortions, we then estimate the quality score of the image. To achieve this, we propose an explainable approach for distortion identification based on attribute learning. Instead of prompting VLMs with the names of distortions, we prompt them with the attributes or effects of distortions and aggregate this information to infer the distortion strength. Additionally, we consider multiple distortions per image, making our method more scalable. To support this, we generate a dataset consisting of 100,000 images for efficient training. Finally, attribute probabilities are retrieved and fed into a regressor to predict the image quality score. The results show that our approach, besides its explainability and transparency, achieves state-of-the-art (SOTA) performance across multiple datasets in both PLCC and SRCC metrics. Moreover, the zero-shot results demonstrate the generalizability of the proposed approach.

9/12/2024