Vision-Language Consistency Guided Multi-modal Prompt Learning for Blind AI Generated Image Quality Assessment

2406.16641

Published 6/26/2024 by Jun Fu, Wei Zhou, Qiuping Jiang, Hantao Liu, Guangtao Zhai

Vision-Language Consistency Guided Multi-modal Prompt Learning for Blind AI Generated Image Quality Assessment

Abstract

Recently, textual prompt tuning has shown inspirational performance in adapting Contrastive Language-Image Pre-training (CLIP) models to natural image quality assessment. However, such uni-modal prompt learning method only tunes the language branch of CLIP models. This is not enough for adapting CLIP models to AI generated image quality assessment (AGIQA) since AGIs visually differ from natural images. In addition, the consistency between AGIs and user input text prompts, which correlates with the perceptual quality of AGIs, is not investigated to guide AGIQA. In this letter, we propose vision-language consistency guided multi-modal prompt learning for blind AGIQA, dubbed CLIP-AGIQA. Specifically, we introduce learnable textual and visual prompts in language and vision branches of CLIP models, respectively. Moreover, we design a text-to-image alignment quality prediction task, whose learned vision-language consistency knowledge is used to guide the optimization of the above multi-modal prompts. Experimental results on two public AGIQA datasets demonstrate that the proposed method outperforms state-of-the-art quality assessment models. The source code is available at https://github.com/JunFu1995/CLIP-AGIQA.

Create account to get full access

Overview

The paper proposes a multi-modal prompt learning approach for blind AI-generated image quality assessment (AGIQA)
It leverages vision-language consistency to guide the learning of prompts that can effectively evaluate the quality of AI-generated images
The approach aims to improve the performance of AGIQA models, which are crucial for ensuring the safety and reliability of AI-generated content

Plain English Explanation

Bringing Textual Prompt to AI-Generated Image and Multi-Modal Prompt Learning for Blind Image Quality are two related approaches that use prompts to help AI systems evaluate the quality of AI-generated images. Prompts are short phrases or sentences that provide instructions or guidance to an AI model.

The key idea of this paper is to use prompts that are consistent with both the visual and textual information in the AI-generated images. By ensuring the prompts are well-aligned with both the image and its description, the AI system can better assess the overall quality of the generated content.

This is important because PCQA: A Strong Baseline for AI-Generated Content Quality Assessment and other research have shown that accurately evaluating the quality of AI-generated images is crucial for ensuring their safety and reliability. The RankClip: Ranking-Consistent Language-Image Pretraining and IntCoOp: Interpretability-Aware Vision-Language Prompt Tuning approaches provide additional context on how prompts and multi-modal learning can be used to improve image quality assessment.

Technical Explanation

The proposed method, called Vision-Language Consistency Guided Multi-modal Prompt Learning (VLC-MPL), learns prompts that are aligned with both the visual and textual information in AI-generated images. The approach consists of three key components:

Vision-Language Consistency Learning: The model is trained to generate prompts that are consistent with the visual and textual information in the AI-generated images. This is achieved by optimizing a consistency loss that encourages the prompts to accurately reflect the content of the images and their descriptions.
Prompt-Guided Blind AGIQA: The learned prompts are then used to guide the blind AGIQA model, which evaluates the quality of the AI-generated images without access to ground-truth information. The prompts help the model focus on relevant aspects of the images and improve its quality assessment.
Multi-Modal Prompt Optimization: The prompts are further optimized by incorporating both visual and textual information from the AI-generated images, ensuring they are well-aligned with both modalities.

The researchers evaluate their approach on several benchmark datasets for AGIQA and demonstrate that VLC-MPL outperforms state-of-the-art methods, particularly in scenarios with limited ground-truth data for training the AGIQA model.

Critical Analysis

The paper presents a well-designed and thorough approach to improving blind AGIQA using multi-modal prompt learning. The key strength of the method is its ability to leverage the consistency between visual and textual information to guide the prompt learning process, which is a novel and promising direction for this problem.

However, the paper does not address several potential limitations and areas for further research. For example, the effectiveness of the approach may be dependent on the quality and diversity of the training data, and it is unclear how the method would perform in scenarios with significant distribution shift or domain mismatch between the training and test data.

Additionally, the paper does not provide much discussion on the interpretability and transparency of the learned prompts. Understanding the reasoning behind the prompts could be important for building trust and ensuring the reliability of the AGIQA system, especially in safety-critical applications.

Overall, the research presented in this paper is a valuable contribution to the field of AGIQA, and the proposed VLC-MPL approach shows promising results. However, further investigation into the limitations and potential extensions of the method would be beneficial to fully understand its practical implications and suitability for real-world deployment.

Conclusion

The paper introduces a novel multi-modal prompt learning approach, VLC-MPL, for blind AI-generated image quality assessment (AGIQA). By leveraging the consistency between visual and textual information, the method learns prompts that can effectively guide the AGIQA model to accurately evaluate the quality of AI-generated images, even in scenarios with limited ground-truth data.

The proposed approach outperforms state-of-the-art methods on several AGIQA benchmarks, demonstrating its potential to improve the reliability and safety of AI-generated content. While the paper highlights several promising aspects of the research, it also identifies areas for further investigation, such as the interpretability of the learned prompts and the method's performance under distribution shift.

Overall, this work represents an important step forward in the field of AGIQA and highlights the value of incorporating multi-modal information and prompt-based techniques to address the challenges of evaluating the quality of AI-generated images.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Bringing Textual Prompt to AI-Generated Image Quality Assessment

Bowen Qu, Haohui Li, Wei Gao

AI-Generated Images (AGIs) have inherent multimodal nature. Unlike traditional image quality assessment (IQA) on natural scenarios, AGIs quality assessment (AGIQA) takes the correspondence of image and its textual prompt into consideration. This is coupled in the ground truth score, which confuses the unimodal IQA methods. To solve this problem, we introduce IP-IQA (AGIs Quality Assessment via Image and Prompt), a multimodal framework for AGIQA via corresponding image and prompt incorporation. Specifically, we propose a novel incremental pretraining task named Image2Prompt for better understanding of AGIs and their corresponding textual prompts. An effective and efficient image-prompt fusion module, along with a novel special [QA] token, are also applied. Both are plug-and-play and beneficial for the cooperation of image and its corresponding prompt. Experiments demonstrate that our IP-IQA achieves the state-of-the-art on AGIQA-1k and AGIQA-3k datasets. Code will be available at https://github.com/Coobiw/IP-IQA.

5/22/2024

cs.CV cs.MM

🖼️

Multi-Modal Prompt Learning on Blind Image Quality Assessment

Wensheng Pan, Timin Gao, Yan Zhang, Runze Hu, Xiawu Zheng, Enwei Zhang, Yuting Gao, Yutao Liu, Yunhang Shen, Ke Li, Shengchuan Zhang, Liujuan Cao, Rongrong Ji

Image Quality Assessment (IQA) models benefit significantly from semantic information, which allows them to treat different types of objects distinctly. Currently, leveraging semantic information to enhance IQA is a crucial research direction. Traditional methods, hindered by a lack of sufficiently annotated data, have employed the CLIP image-text pretraining model as their backbone to gain semantic awareness. However, the generalist nature of these pre-trained Vision-Language (VL) models often renders them suboptimal for IQA-specific tasks. Recent approaches have attempted to address this mismatch using prompt technology, but these solutions have shortcomings. Existing prompt-based VL models overly focus on incremental semantic information from text, neglecting the rich insights available from visual data analysis. This imbalance limits their performance improvements in IQA tasks. This paper introduces an innovative multi-modal prompt-based methodology for IQA. Our approach employs carefully crafted prompts that synergistically mine incremental semantic information from both visual and linguistic data. Specifically, in the visual branch, we introduce a multi-layer prompt structure to enhance the VL model's adaptability. In the text branch, we deploy a dual-prompt scheme that steers the model to recognize and differentiate between scene category and distortion type, thereby refining the model's capacity to assess image quality. Our experimental findings underscore the effectiveness of our method over existing Blind Image Quality Assessment (BIQA) approaches. Notably, it demonstrates competitive performance across various datasets. Our method achieves Spearman Rank Correlation Coefficient (SRCC) values of 0.961(surpassing 0.946 in CSIQ) and 0.941 (exceeding 0.930 in KADID), illustrating its robustness and accuracy in diverse contexts.

5/21/2024

cs.CV

PCQA: A Strong Baseline for AIGC Quality Assessment Based on Prompt Condition

Xi Fang, Weigang Wang, Xiaoxin Lv, Jun Yan

The development of Large Language Models (LLM) and Diffusion Models brings the boom of Artificial Intelligence Generated Content (AIGC). It is essential to build an effective quality assessment framework to provide a quantifiable evaluation of different images or videos based on the AIGC technologies. The content generated by AIGC methods is driven by the crafted prompts. Therefore, it is intuitive that the prompts can also serve as the foundation of the AIGC quality assessment. This study proposes an effective AIGC quality assessment (QA) framework. First, we propose a hybrid prompt encoding method based on a dual-source CLIP (Contrastive Language-Image Pre-Training) text encoder to understand and respond to the prompt conditions. Second, we propose an ensemble-based feature mixer module to effectively blend the adapted prompt and vision features. The empirical study practices in two datasets: AIGIQA-20K (AI-Generated Image Quality Assessment database) and T2VQA-DB (Text-to-Video Quality Assessment DataBase), which validates the effectiveness of our proposed method: Prompt Condition Quality Assessment (PCQA). Our proposed simple and feasible framework may promote research development in the multimodal generation field.

4/23/2024

cs.CV

RankCLIP: Ranking-Consistent Language-Image Pretraining

Yiming Zhang, Zhuokai Zhao, Zhaorun Chen, Zhili Feng, Zenghui Ding, Yining Sun

Self-supervised contrastive learning models, such as CLIP, have set new benchmarks for vision-language models in many downstream tasks. However, their dependency on rigid one-to-one mappings overlooks the complex and often multifaceted relationships between and within texts and images. To this end, we introduce RANKCLIP, a novel pretraining method that extends beyond the rigid one-to-one matching framework of CLIP and its variants. By extending the traditional pair-wise loss to list-wise, and leveraging both in-modal and cross-modal ranking consistency, RANKCLIP improves the alignment process, enabling it to capture the nuanced many-to-many relationships between and within each modality. Through comprehensive experiments, we demonstrate the effectiveness of RANKCLIP in various downstream tasks, notably achieving significant gains in zero-shot classifications over state-of-the-art methods, underscoring the importance of this enhanced learning process.

6/21/2024

cs.CV cs.AI cs.LG