GenzIQA: Generalized Image Quality Assessment using Prompt-Guided Latent Diffusion Models

Read original: arXiv:2406.04654 - Published 6/10/2024 by Diptanu De, Shankhanil Mitra, Rajiv Soundararajan

GenzIQA: Generalized Image Quality Assessment using Prompt-Guided Latent Diffusion Models

Overview

This paper proposes GenzIQA, a generalized image quality assessment (IQA) approach that leverages prompt-guided latent diffusion models.
The key idea is to use a pre-trained diffusion model as a flexible feature extractor, and then fine-tune it on various IQA datasets by conditioning the model on natural language prompts.
This allows the model to learn a more generalized representation of image quality that can be applied to diverse datasets and tasks, going beyond traditional supervised IQA models.

Plain English Explanation

The paper introduces a new way to evaluate the quality of images called GenzIQA. The core innovation is using a pre-trained diffusion model - a type of AI model that can generate new images - as the foundation. The researchers then fine-tune this diffusion model by providing it with natural language prompts that describe different aspects of image quality.

This allows the model to learn a more general understanding of what makes an image high or low quality, rather than being limited to a specific training dataset. The model can then be applied to assess the quality of all kinds of images, going beyond just the types it was trained on.

This flexible, generalized approach is the key innovation of this work, allowing the IQA model to be more widely applicable compared to previous methods.

Technical Explanation

The authors leverage a pre-trained diffusion model as the backbone of their GenzIQA approach. Diffusion models are a type of generative AI that can create new images by gradually adding noise to an input image and then learning to reverse that process.

By fine-tuning this pre-trained diffusion model on various IQA datasets, while conditioning it on natural language prompts describing image quality, the model is able to learn a more generalized representation of visual quality. The prompts guide the model to focus on specific aspects of quality, like sharpness, contrast, or color.

This prompt-guided fine-tuning allows GenzIQA to be applied to a wide range of IQA tasks and datasets, going beyond the limitations of traditional supervised IQA models that are constrained to the specific data they are trained on. Experiments show GenzIQA outperforming these prior approaches on multiple benchmark IQA datasets.

Critical Analysis

A key strength of the GenzIQA approach is its flexibility and potential for generalization. By leveraging a pre-trained diffusion model and fine-tuning it with prompts, the model can learn a more versatile understanding of image quality that is not tied to a single dataset or set of quality attributes.

However, the paper does not extensively explore the model's ability to transfer to completely novel datasets or real-world applications beyond the standard IQA benchmarks. Further research is needed to fully demonstrate the generalization capabilities claimed by the authors.

Additionally, the computational cost and training complexity of the prompt-guided fine-tuning process is not discussed in detail. Deploying such a model in practical scenarios may require careful optimization and resource considerations.

Conclusion

The GenzIQA paper presents a novel approach to image quality assessment that aims to overcome the limitations of traditional supervised models. By incorporating prompt-guided fine-tuning of a pre-trained diffusion model, the researchers have developed a more flexible and generalizable IQA system.

This work has the potential to enable more robust and adaptable quality evaluation for a wide range of visual applications, from image enhancement to content moderation. However, further research is needed to fully validate the model's generalization capabilities and explore its practical deployment considerations.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

GenzIQA: Generalized Image Quality Assessment using Prompt-Guided Latent Diffusion Models

Diptanu De, Shankhanil Mitra, Rajiv Soundararajan

The design of no-reference (NR) image quality assessment (IQA) algorithms is extremely important to benchmark and calibrate user experiences in modern visual systems. A major drawback of state-of-the-art NR-IQA methods is their limited ability to generalize across diverse IQA settings with reasonable distribution shifts. Recent text-to-image generative models such as latent diffusion models generate meaningful visual concepts with fine details related to text concepts. In this work, we leverage the denoising process of such diffusion models for generalized IQA by understanding the degree of alignment between learnable quality-aware text prompts and images. In particular, we learn cross-attention maps from intermediate layers of the denoiser of latent diffusion models to capture quality-aware representations of images. In addition, we also introduce learnable quality-aware text prompts that enable the cross-attention features to be better quality-aware. Our extensive cross database experiments across various user-generated, synthetic, and low-light content-based benchmarking databases show that latent diffusion models can achieve superior generalization in IQA when compared to other methods in the literature.

6/10/2024

DP-IQA: Utilizing Diffusion Prior for Blind Image Quality Assessment in the Wild

Honghao Fu, Yufei Wang, Wenhan Yang, Bihan Wen

Blind image quality assessment (IQA) in the wild, which assesses the quality of images with complex authentic distortions and no reference images, presents significant challenges. Given the difficulty in collecting large-scale training data, leveraging limited data to develop a model with strong generalization remains an open problem. Motivated by the robust image perception capabilities of pre-trained text-to-image (T2I) diffusion models, we propose a novel IQA method, diffusion priors-based IQA (DP-IQA), to utilize the T2I model's prior for improved performance and generalization ability. Specifically, we utilize pre-trained Stable Diffusion as the backbone, extracting multi-level features from the denoising U-Net guided by prompt embeddings through a tunable text adapter. Simultaneously, an image adapter compensates for information loss introduced by the lossy pre-trained encoder. Unlike T2I models that require full image distribution modeling, our approach targets image quality assessment, which inherently requires fewer parameters. To improve applicability, we distill the knowledge into a lightweight CNN-based student model, significantly reducing parameters while maintaining or even enhancing generalization performance. Experimental results demonstrate that DP-IQA achieves state-of-the-art performance on various in-the-wild datasets, highlighting the superior generalization capability of T2I priors in blind IQA tasks. To our knowledge, DP-IQA is the first method to apply pre-trained diffusion priors in blind IQA. Codes and checkpoints are available at https://github.com/RomGai/DP-IQA.

8/20/2024

CLIP-Guided Attribute Aware Pretraining for Generalizable Image Quality Assessment

Daekyu Kwon, Dongyoung Kim, Sehwan Ki, Younghyun Jo, Hyong-Euk Lee, Seon Joo Kim

In no-reference image quality assessment (NR-IQA), the challenge of limited dataset sizes hampers the development of robust and generalizable models. Conventional methods address this issue by utilizing large datasets to extract rich representations for IQA. Also, some approaches propose vision language models (VLM) based IQA, but the domain gap between generic VLM and IQA constrains their scalability. In this work, we propose a novel pretraining framework that constructs a generalizable representation for IQA by selectively extracting quality-related knowledge from VLM and leveraging the scalability of large datasets. Specifically, we carefully select optimal text prompts for five representative image quality attributes and use VLM to generate pseudo-labels. Numerous attribute-aware pseudo-labels can be generated with large image datasets, allowing our IQA model to learn rich representations about image quality. Our approach achieves state-of-the-art performance on multiple IQA datasets and exhibits remarkable generalization capabilities. Leveraging these strengths, we propose several applications, such as evaluating image generation models and training image enhancement models, demonstrating our model's real-world applicability. We will make the code available for access.

6/4/2024

🖼️

Multi-Modal Prompt Learning on Blind Image Quality Assessment

Wensheng Pan, Timin Gao, Yan Zhang, Runze Hu, Xiawu Zheng, Enwei Zhang, Yuting Gao, Yutao Liu, Yunhang Shen, Ke Li, Shengchuan Zhang, Liujuan Cao, Rongrong Ji

Image Quality Assessment (IQA) models benefit significantly from semantic information, which allows them to treat different types of objects distinctly. Currently, leveraging semantic information to enhance IQA is a crucial research direction. Traditional methods, hindered by a lack of sufficiently annotated data, have employed the CLIP image-text pretraining model as their backbone to gain semantic awareness. However, the generalist nature of these pre-trained Vision-Language (VL) models often renders them suboptimal for IQA-specific tasks. Recent approaches have attempted to address this mismatch using prompt technology, but these solutions have shortcomings. Existing prompt-based VL models overly focus on incremental semantic information from text, neglecting the rich insights available from visual data analysis. This imbalance limits their performance improvements in IQA tasks. This paper introduces an innovative multi-modal prompt-based methodology for IQA. Our approach employs carefully crafted prompts that synergistically mine incremental semantic information from both visual and linguistic data. Specifically, in the visual branch, we introduce a multi-layer prompt structure to enhance the VL model's adaptability. In the text branch, we deploy a dual-prompt scheme that steers the model to recognize and differentiate between scene category and distortion type, thereby refining the model's capacity to assess image quality. Our experimental findings underscore the effectiveness of our method over existing Blind Image Quality Assessment (BIQA) approaches. Notably, it demonstrates competitive performance across various datasets. Our method achieves Spearman Rank Correlation Coefficient (SRCC) values of 0.961(surpassing 0.946 in CSIQ) and 0.941 (exceeding 0.930 in KADID), illustrating its robustness and accuracy in diverse contexts.

5/21/2024