Depicting Beyond Scores: Advancing Image Quality Assessment through Multi-modal Language Models

Read original: arXiv:2312.08962 - Published 7/16/2024 by Zhiyuan You, Zheyuan Li, Jinjin Gu, Zhenfei Yin, Tianfan Xue, Chao Dong

Depicting Beyond Scores: Advancing Image Quality Assessment through Multi-modal Language Models

Overview

This paper introduces a new task called DepictQA, which aims to advance image quality assessment through the use of multi-modal language models.
The researchers create a dataset of images and their corresponding natural language descriptions, and then use this data to train a multi-modal model to assess image quality.
The key idea is to go beyond traditional score-based image quality assessment by incorporating the rich language-based information that humans use to describe and evaluate images.

Plain English Explanation

The researchers in this paper are trying to improve how we evaluate the quality of images. Traditionally, image quality has been assessed using numerical scores, but the researchers argue that this approach is limited because it doesn't capture the nuanced, language-based way that humans actually describe and judge image quality.

To address this, the researchers developed a new task called DepictQA, which stands for "Depicting Beyond Scores." The idea is to use multi-modal language models that can understand both images and natural language to assess image quality. The researchers created a dataset of images and their corresponding human-written descriptions, and then trained a model to use this data to evaluate image quality.

The key advantage of this approach is that it allows the model to learn the nuanced, language-based ways that humans describe and judge images, going beyond just numerical scores. For example, a human might describe an image as "vibrant," "blurry," or "lacking in contrast," and the model can learn to understand and apply these types of qualitative assessments.

Overall, the researchers believe that this multi-modal, language-based approach to image quality assessment could lead to more accurate and meaningful evaluations of image quality, with applications in areas like photography, digital media, and computer vision.

Technical Explanation

The key technical contributions of this paper are:

DepictQA Task and Dataset: The researchers create a new task and dataset for image quality assessment, called DepictQA. The dataset consists of images paired with human-written natural language descriptions that evaluate the quality of the images.
Multi-modal Language Model: The researchers use a multi-modal language model to tackle the DepictQA task. This model is trained to jointly understand and reason about both the image and the accompanying language-based quality descriptions.
Evaluation Metrics: The researchers introduce new evaluation metrics that go beyond traditional numerical image quality scores, incorporating language-based assessments of image quality.
Experiments and Insights: The researchers conduct extensive experiments to evaluate their multi-modal language model on the DepictQA task, and provide insights into the strengths and limitations of this approach compared to existing image quality assessment methods.

Critical Analysis

One key limitation of this research, as acknowledged by the authors, is the relatively small size of the DepictQA dataset. While the dataset provides a valuable starting point, the researchers note that expanding the dataset with more diverse images and language descriptions could lead to further improvements in the model's performance and generalization.

Additionally, the researchers mention that their current evaluation metrics, while capturing language-based quality assessments, may not fully reflect the subjective and contextual nature of human perception of image quality. Exploring alternative evaluation approaches, perhaps drawing from adaptive image quality assessment techniques or multi-modal prompt learning, could be a fruitful area for future research.

Overall, this paper represents an important step towards advancing image quality assessment by incorporating the rich, language-based ways that humans describe and evaluate images. The researchers have laid the groundwork for an exciting new direction in this field, and their work opens up numerous avenues for further exploration and refinement.

Conclusion

In this paper, the researchers introduce a novel approach to image quality assessment that goes beyond traditional numerical scoring methods. By leveraging multi-modal language models trained on a dataset of images and their corresponding natural language descriptions, the researchers demonstrate the potential to capture the nuanced, qualitative ways that humans perceive and evaluate image quality.

The DepictQA task and dataset, along with the researchers' multi-modal language model and evaluation metrics, represent a significant advancement in the field of image quality assessment. This work lays the foundation for future research that can further explore the intersection of language, perception, and image quality, with potential applications in areas such as photography, digital media, and computer vision.

Overall, this paper represents an important step forward in our understanding of how to better assess and depict the quality of images, moving beyond simplistic numerical scores and towards a more holistic, language-based approach that aligns with human intuition and perception.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Depicting Beyond Scores: Advancing Image Quality Assessment through Multi-modal Language Models

Zhiyuan You, Zheyuan Li, Jinjin Gu, Zhenfei Yin, Tianfan Xue, Chao Dong

We introduce a Depicted image Quality Assessment method (DepictQA), overcoming the constraints of traditional score-based methods. DepictQA allows for detailed, language-based, human-like evaluation of image quality by leveraging Multi-modal Large Language Models (MLLMs). Unlike conventional Image Quality Assessment (IQA) methods relying on scores, DepictQA interprets image content and distortions descriptively and comparatively, aligning closely with humans' reasoning process. To build the DepictQA model, we establish a hierarchical task framework, and collect a multi-modal IQA training dataset. To tackle the challenges of limited training data and multi-image processing, we propose to use multi-source training data and specialized image tags. These designs result in a better performance of DepictQA than score-based approaches on multiple benchmarks. Moreover, compared with general MLLMs, DepictQA can generate more accurate reasoning descriptive languages. We also demonstrate that our full-reference dataset can be extended to non-reference applications. These results showcase the research potential of multi-modal IQA methods. Codes and datasets are available in https://depictqa.github.io.

7/16/2024

Descriptive Image Quality Assessment in the Wild

Zhiyuan You, Jinjin Gu, Zheyuan Li, Xin Cai, Kaiwen Zhu, Chao Dong, Tianfan Xue

With the rapid advancement of Vision Language Models (VLMs), VLM-based Image Quality Assessment (IQA) seeks to describe image quality linguistically to align with human expression and capture the multifaceted nature of IQA tasks. However, current methods are still far from practical usage. First, prior works focus narrowly on specific sub-tasks or settings, which do not align with diverse real-world applications. Second, their performance is sub-optimal due to limitations in dataset coverage, scale, and quality. To overcome these challenges, we introduce Depicted image Quality Assessment in the Wild (DepictQA-Wild). Our method includes a multi-functional IQA task paradigm that encompasses both assessment and comparison tasks, brief and detailed responses, full-reference and non-reference scenarios. We introduce a ground-truth-informed dataset construction approach to enhance data quality, and scale up the dataset to 495K under the brief-detail joint framework. Consequently, we construct a comprehensive, large-scale, and high-quality dataset, named DQ-495K. We also retain image resolution during training to better handle resolution-related quality issues, and estimate a confidence score that is helpful to filter out low-quality responses. Experimental results demonstrate that DepictQA-Wild significantly outperforms traditional score-based methods, prior VLM-based IQA models, and proprietary GPT-4V in distortion identification, instant rating, and reasoning tasks. Our advantages are further confirmed by real-world applications including assessing the web-downloaded images and ranking model-processed images. Datasets and codes will be released in https://depictqa.github.io/depictqa-wild/.

6/13/2024

A Comprehensive Study of Multimodal Large Language Models for Image Quality Assessment

Tianhe Wu, Kede Ma, Jie Liang, Yujiu Yang, Lei Zhang

While Multimodal Large Language Models (MLLMs) have experienced significant advancement in visual understanding and reasoning, their potential to serve as powerful, flexible, interpretable, and text-driven models for Image Quality Assessment (IQA) remains largely unexplored. In this paper, we conduct a comprehensive and systematic study of prompting MLLMs for IQA. We first investigate nine prompting systems for MLLMs as the combinations of three standardized testing procedures in psychophysics (i.e., the single-stimulus, double-stimulus, and multiple-stimulus methods) and three popular prompting strategies in natural language processing (i.e., the standard, in-context, and chain-of-thought prompting). We then present a difficult sample selection procedure, taking into account sample diversity and uncertainty, to further challenge MLLMs equipped with the respective optimal prompting systems. We assess three open-source and one closed-source MLLMs on several visual attributes of image quality (e.g., structural and textural distortions, geometric transformations, and color differences) in both full-reference and no-reference scenarios. Experimental results show that only the closed-source GPT-4V provides a reasonable account for human perception of image quality, but is weak at discriminating fine-grained quality variations (e.g., color differences) and at comparing visual quality of multiple images, tasks humans can perform effortlessly.

7/12/2024

UniQA: Unified Vision-Language Pre-training for Image Quality and Aesthetic Assessment

Hantao Zhou, Longxiang Tang, Rui Yang, Guanyi Qin, Yan Zhang, Runze Hu, Xiu Li

Image Quality Assessment (IQA) and Image Aesthetic Assessment (IAA) aim to simulate human subjective perception of image visual quality and aesthetic appeal. Existing methods typically address these tasks independently due to distinct learning objectives. However, they neglect the underlying interconnectedness of both tasks, which hinders the learning of task-agnostic shared representations for human subjective perception. To confront this challenge, we propose Unified vision-language pre-training of Quality and Aesthetics (UniQA), to learn general perceptions of two tasks, thereby benefiting them simultaneously. Addressing the absence of text in the IQA datasets and the presence of textual noise in the IAA datasets, (1) we utilize multimodal large language models (MLLMs) to generate high-quality text descriptions; (2) the generated text for IAA serves as metadata to purify noisy IAA data. To effectively adapt the pre-trained UniQA to downstream tasks, we further propose a lightweight adapter that utilizes versatile cues to fully exploit the extensive knowledge of the pre-trained model. Extensive experiments demonstrate that our approach attains a new state-of-the-art performance on both IQA and IAA tasks, while concurrently showcasing exceptional zero-shot and few-label image assessment capabilities. The source code will be available at https://github.com/zht8506/UniQA.

6/4/2024