Adaptive Image Quality Assessment via Teaching Large Multimodal Model to Compare

Read original: arXiv:2405.19298 - Published 5/30/2024 by Hanwei Zhu, Haoning Wu, Yixuan Li, Zicheng Zhang, Baoliang Chen, Lingyu Zhu, Yuming Fang, Guangtao Zhai, Weisi Lin, Shiqi Wang

Adaptive Image Quality Assessment via Teaching Large Multimodal Model to Compare

Overview

This research paper proposes a novel approach to image quality assessment (IQA) that leverages the capabilities of large pre-trained multimodal models.
The key idea is to "teach" these models to compare image quality by fine-tuning them on human-labeled image quality datasets.
The resulting model can then be used to adaptively assess the quality of images, outperforming traditional IQA methods.

Plain English Explanation

The paper explores a new way to evaluate the quality of images using powerful artificial intelligence (AI) models. Traditional IQA methods often struggle to capture the nuanced factors that humans use to judge image quality. This research aims to overcome this by teaching large AI models, which are already skilled at understanding images and language, to directly assess and compare image quality.

The researchers fine-tune these pre-trained models on datasets where humans have labeled the quality of various images. By learning from this human feedback, the models develop a more sophisticated understanding of what makes an image high or low quality. Once trained, the models can then be used to adaptively evaluate the quality of new images, potentially outperforming existing IQA approaches.

This adaptive IQA system could have important applications in areas like digital image processing, computational photography, and computer vision. By providing more nuanced and reliable quality assessments, it could help optimize image capture and editing workflows, as well as enable new applications like intelligent photo curation and multi-view image evaluation.

Technical Explanation

The core of this work is a novel IQA framework that leverages the power of large pre-trained multimodal models. The researchers hypothesize that these models, which have been trained on vast amounts of image and text data, can learn to assess image quality more effectively than traditional IQA methods.

The approach involves fine-tuning a pre-trained model, such as CLIP or Multimodal-T5, on human-labeled image quality datasets. This fine-tuning process teaches the model to directly compare and evaluate the quality of images, rather than relying on hand-crafted quality metrics.

The researchers conduct experiments on several standard IQA benchmarks, demonstrating that their fine-tuned models outperform state-of-the-art IQA algorithms across a range of image types and distortions. They also show that the models can adapt to different quality assessment tasks, such as predicting human mean opinion scores or ranking images by quality.

Critical Analysis

The proposed adaptive IQA framework is a promising approach that leverages the impressive capabilities of large multimodal models. By fine-tuning these models on human-labeled data, the researchers are able to imbue them with a more nuanced understanding of image quality that goes beyond traditional metrics.

However, the paper does not address some potential limitations of this approach. For example, the fine-tuning process may be data-intensive, requiring large and diverse image quality datasets to achieve good performance. Additionally, the adaptability of the models to different quality assessment tasks, while demonstrated, may be constrained by the specific datasets and fine-tuning approaches used.

Furthermore, the paper does not delve into the interpretability of the models' quality assessments. Understanding the factors and reasoning behind the models' judgments could be important for practical applications, where transparency and explainability may be crucial.

Despite these potential caveats, the overall approach represents an exciting step forward in the field of IQA, leveraging the power of large multimodal models to develop more sophisticated and adaptive quality assessment systems.

Conclusion

This research paper presents a novel framework for adaptive image quality assessment that harnesses the capabilities of large pre-trained multimodal models. By fine-tuning these models on human-labeled data, the researchers are able to imbue them with a more nuanced understanding of image quality, allowing them to outperform traditional IQA methods.

The adaptive nature of this approach holds significant promise for a range of applications, from digital image processing and computational photography to intelligent photo curation and multi-view image evaluation. As the field of AI continues to advance, the integration of powerful multimodal models into IQA systems may become an increasingly important strategy for improving image quality assessment and enabling new visual applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Adaptive Image Quality Assessment via Teaching Large Multimodal Model to Compare

Hanwei Zhu, Haoning Wu, Yixuan Li, Zicheng Zhang, Baoliang Chen, Lingyu Zhu, Yuming Fang, Guangtao Zhai, Weisi Lin, Shiqi Wang

While recent advancements in large multimodal models (LMMs) have significantly improved their abilities in image quality assessment (IQA) relying on absolute quality rating, how to transfer reliable relative quality comparison outputs to continuous perceptual quality scores remains largely unexplored. To address this gap, we introduce Compare2Score-an all-around LMM-based no-reference IQA (NR-IQA) model, which is capable of producing qualitatively comparative responses and effectively translating these discrete comparative levels into a continuous quality score. Specifically, during training, we present to generate scaled-up comparative instructions by comparing images from the same IQA dataset, allowing for more flexible integration of diverse IQA datasets. Utilizing the established large-scale training corpus, we develop a human-like visual quality comparator. During inference, moving beyond binary choices, we propose a soft comparison method that calculates the likelihood of the test image being preferred over multiple predefined anchor images. The quality score is further optimized by maximum a posteriori estimation with the resulting probability matrix. Extensive experiments on nine IQA datasets validate that the Compare2Score effectively bridges text-defined comparative levels during training with converted single image quality score for inference, surpassing state-of-the-art IQA models across diverse scenarios. Moreover, we verify that the probability-matrix-based inference conversion not only improves the rating accuracy of Compare2Score but also zero-shot general-purpose LMMs, suggesting its intrinsic effectiveness.

5/30/2024

Depicting Beyond Scores: Advancing Image Quality Assessment through Multi-modal Language Models

Zhiyuan You, Zheyuan Li, Jinjin Gu, Zhenfei Yin, Tianfan Xue, Chao Dong

We introduce a Depicted image Quality Assessment method (DepictQA), overcoming the constraints of traditional score-based methods. DepictQA allows for detailed, language-based, human-like evaluation of image quality by leveraging Multi-modal Large Language Models (MLLMs). Unlike conventional Image Quality Assessment (IQA) methods relying on scores, DepictQA interprets image content and distortions descriptively and comparatively, aligning closely with humans' reasoning process. To build the DepictQA model, we establish a hierarchical task framework, and collect a multi-modal IQA training dataset. To tackle the challenges of limited training data and multi-image processing, we propose to use multi-source training data and specialized image tags. These designs result in a better performance of DepictQA than score-based approaches on multiple benchmarks. Moreover, compared with general MLLMs, DepictQA can generate more accurate reasoning descriptive languages. We also demonstrate that our full-reference dataset can be extended to non-reference applications. These results showcase the research potential of multi-modal IQA methods. Codes and datasets are available in https://depictqa.github.io.

7/16/2024

A Comprehensive Study of Multimodal Large Language Models for Image Quality Assessment

Tianhe Wu, Kede Ma, Jie Liang, Yujiu Yang, Lei Zhang

While Multimodal Large Language Models (MLLMs) have experienced significant advancement in visual understanding and reasoning, their potential to serve as powerful, flexible, interpretable, and text-driven models for Image Quality Assessment (IQA) remains largely unexplored. In this paper, we conduct a comprehensive and systematic study of prompting MLLMs for IQA. We first investigate nine prompting systems for MLLMs as the combinations of three standardized testing procedures in psychophysics (i.e., the single-stimulus, double-stimulus, and multiple-stimulus methods) and three popular prompting strategies in natural language processing (i.e., the standard, in-context, and chain-of-thought prompting). We then present a difficult sample selection procedure, taking into account sample diversity and uncertainty, to further challenge MLLMs equipped with the respective optimal prompting systems. We assess three open-source and one closed-source MLLMs on several visual attributes of image quality (e.g., structural and textural distortions, geometric transformations, and color differences) in both full-reference and no-reference scenarios. Experimental results show that only the closed-source GPT-4V provides a reasonable account for human perception of image quality, but is weak at discriminating fine-grained quality variations (e.g., color differences) and at comparing visual quality of multiple images, tasks humans can perform effortlessly.

7/12/2024

Descriptive Image Quality Assessment in the Wild

Zhiyuan You, Jinjin Gu, Zheyuan Li, Xin Cai, Kaiwen Zhu, Chao Dong, Tianfan Xue

With the rapid advancement of Vision Language Models (VLMs), VLM-based Image Quality Assessment (IQA) seeks to describe image quality linguistically to align with human expression and capture the multifaceted nature of IQA tasks. However, current methods are still far from practical usage. First, prior works focus narrowly on specific sub-tasks or settings, which do not align with diverse real-world applications. Second, their performance is sub-optimal due to limitations in dataset coverage, scale, and quality. To overcome these challenges, we introduce Depicted image Quality Assessment in the Wild (DepictQA-Wild). Our method includes a multi-functional IQA task paradigm that encompasses both assessment and comparison tasks, brief and detailed responses, full-reference and non-reference scenarios. We introduce a ground-truth-informed dataset construction approach to enhance data quality, and scale up the dataset to 495K under the brief-detail joint framework. Consequently, we construct a comprehensive, large-scale, and high-quality dataset, named DQ-495K. We also retain image resolution during training to better handle resolution-related quality issues, and estimate a confidence score that is helpful to filter out low-quality responses. Experimental results demonstrate that DepictQA-Wild significantly outperforms traditional score-based methods, prior VLM-based IQA models, and proprietary GPT-4V in distortion identification, instant rating, and reasoning tasks. Our advantages are further confirmed by real-world applications including assessing the web-downloaded images and ranking model-processed images. Datasets and codes will be released in https://depictqa.github.io/depictqa-wild/.

6/13/2024