Descriptive Image Quality Assessment in the Wild

Read original: arXiv:2405.18842 - Published 6/13/2024 by Zhiyuan You, Jinjin Gu, Zheyuan Li, Xin Cai, Kaiwen Zhu, Chao Dong, Tianfan Xue

Descriptive Image Quality Assessment in the Wild

Overview

This paper introduces a new task called Descriptive Image Quality Assessment (DIQA) and a corresponding dataset for training and evaluating models on this task.
DIQA aims to go beyond traditional image quality assessment by having models not just score the quality of an image, but also provide natural language descriptions explaining the reasons for the assessed quality.
The authors construct a large-scale dataset of over 100,000 images with detailed human-written quality descriptions, and propose a multi-task deep learning model that can both predict quality scores and generate quality-related text.

Plain English Explanation

The paper is about a new way to assess the quality of images, going beyond just giving them a numerical score. The researchers created a large dataset of over 100,000 images, where people looked at each image and wrote a description explaining why they thought the image quality was good or bad.

For example, someone might look at an image and say "The lighting is too bright, making the subject look washed out." Or "The focus is a bit soft, so the details are not as sharp as they could be." The goal is to have AI models not just rate the quality with a number, but also generate these kinds of natural language descriptions explaining the quality assessment.

This could be useful in a lot of applications, like photo editing software, social media, and content creation, where understanding the specific reasons for image quality issues can help users improve their photos. It's a more detailed and informative approach compared to just getting a generic quality score.

Technical Explanation

The paper introduces a new task called Descriptive Image Quality Assessment (DIQA), which aims to go beyond traditional numerical image quality assessment (IQA) by also generating natural language descriptions to explain the quality assessment.

To enable research on this task, the authors construct a large-scale DIQA dataset consisting of over 100,000 images from diverse domains, each annotated with human-written quality descriptions. They propose a multi-task deep learning model that is trained to both predict numerical quality scores as well as generate the corresponding quality-related text descriptions.

The model architecture consists of a shared encoder backbone that extracts visual features, coupled with separate task-specific heads for quality regression and text generation. The authors experiment with different text generation approaches, including retrieval-based and generative methods.

Extensive experiments show that the proposed DIQA model can effectively learn to not only assess image quality, but also provide helpful explanations for the quality assessment in the form of natural language descriptions. This goes beyond the capabilities of traditional IQA metrics, and has promising applications in areas like photo editing, content curation, and user feedback.

Critical Analysis

The DIQA task and dataset introduced in this paper represent an important step forward in image quality assessment. By incorporating natural language descriptions, the approach provides much richer and more informative quality feedback compared to just a numerical score.

However, the authors acknowledge some limitations of their work. The dataset, while large, may not fully capture the diversity of real-world image quality issues. The proposed model also has room for improvement in terms of the quality and coherence of the generated text descriptions.

Additionally, the paper does not explore the potential biases or limitations of the human-provided quality descriptions in the dataset. There may be subjective or cultural factors influencing the way people assess and describe image quality that could impact the model's performance and generalization.

Further research is needed to address these challenges, such as exploring ways to collect more diverse and representative quality descriptions, and investigating techniques to improve the faithfulness and fluency of the generated text. Evaluating the practical utility of DIQA models in real-world applications would also be an important next step.

Overall, this work represents an exciting advance in the field of image quality assessment, with the potential to enable more intelligent and user-friendly tools for media creation and curation. As the research in this area progresses, it will be important to carefully consider the societal implications and ethical considerations surrounding these technologies.

Conclusion

This paper introduces a new task called Descriptive Image Quality Assessment (DIQA) and a corresponding large-scale dataset. DIQA goes beyond traditional image quality assessment by having AI models not just score the quality of an image, but also generate natural language descriptions explaining the reasons for the quality assessment.

The authors demonstrate that their proposed multi-task deep learning model can effectively learn to both predict quality scores and produce helpful quality-related text descriptions. This represents an important advance in the field, with potential applications in areas like photo editing, content curation, and user feedback.

While the work has some limitations, it lays the groundwork for further research into more informative and user-friendly image quality assessment technologies. As these models continue to evolve, it will be crucial to carefully consider the societal implications and ensure the ethical development of these capabilities.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Descriptive Image Quality Assessment in the Wild

Zhiyuan You, Jinjin Gu, Zheyuan Li, Xin Cai, Kaiwen Zhu, Chao Dong, Tianfan Xue

With the rapid advancement of Vision Language Models (VLMs), VLM-based Image Quality Assessment (IQA) seeks to describe image quality linguistically to align with human expression and capture the multifaceted nature of IQA tasks. However, current methods are still far from practical usage. First, prior works focus narrowly on specific sub-tasks or settings, which do not align with diverse real-world applications. Second, their performance is sub-optimal due to limitations in dataset coverage, scale, and quality. To overcome these challenges, we introduce Depicted image Quality Assessment in the Wild (DepictQA-Wild). Our method includes a multi-functional IQA task paradigm that encompasses both assessment and comparison tasks, brief and detailed responses, full-reference and non-reference scenarios. We introduce a ground-truth-informed dataset construction approach to enhance data quality, and scale up the dataset to 495K under the brief-detail joint framework. Consequently, we construct a comprehensive, large-scale, and high-quality dataset, named DQ-495K. We also retain image resolution during training to better handle resolution-related quality issues, and estimate a confidence score that is helpful to filter out low-quality responses. Experimental results demonstrate that DepictQA-Wild significantly outperforms traditional score-based methods, prior VLM-based IQA models, and proprietary GPT-4V in distortion identification, instant rating, and reasoning tasks. Our advantages are further confirmed by real-world applications including assessing the web-downloaded images and ranking model-processed images. Datasets and codes will be released in https://depictqa.github.io/depictqa-wild/.

6/13/2024

Depicting Beyond Scores: Advancing Image Quality Assessment through Multi-modal Language Models

Zhiyuan You, Zheyuan Li, Jinjin Gu, Zhenfei Yin, Tianfan Xue, Chao Dong

We introduce a Depicted image Quality Assessment method (DepictQA), overcoming the constraints of traditional score-based methods. DepictQA allows for detailed, language-based, human-like evaluation of image quality by leveraging Multi-modal Large Language Models (MLLMs). Unlike conventional Image Quality Assessment (IQA) methods relying on scores, DepictQA interprets image content and distortions descriptively and comparatively, aligning closely with humans' reasoning process. To build the DepictQA model, we establish a hierarchical task framework, and collect a multi-modal IQA training dataset. To tackle the challenges of limited training data and multi-image processing, we propose to use multi-source training data and specialized image tags. These designs result in a better performance of DepictQA than score-based approaches on multiple benchmarks. Moreover, compared with general MLLMs, DepictQA can generate more accurate reasoning descriptive languages. We also demonstrate that our full-reference dataset can be extended to non-reference applications. These results showcase the research potential of multi-modal IQA methods. Codes and datasets are available in https://depictqa.github.io.

7/16/2024

UniQA: Unified Vision-Language Pre-training for Image Quality and Aesthetic Assessment

Hantao Zhou, Longxiang Tang, Rui Yang, Guanyi Qin, Yan Zhang, Runze Hu, Xiu Li

Image Quality Assessment (IQA) and Image Aesthetic Assessment (IAA) aim to simulate human subjective perception of image visual quality and aesthetic appeal. Existing methods typically address these tasks independently due to distinct learning objectives. However, they neglect the underlying interconnectedness of both tasks, which hinders the learning of task-agnostic shared representations for human subjective perception. To confront this challenge, we propose Unified vision-language pre-training of Quality and Aesthetics (UniQA), to learn general perceptions of two tasks, thereby benefiting them simultaneously. Addressing the absence of text in the IQA datasets and the presence of textual noise in the IAA datasets, (1) we utilize multimodal large language models (MLLMs) to generate high-quality text descriptions; (2) the generated text for IAA serves as metadata to purify noisy IAA data. To effectively adapt the pre-trained UniQA to downstream tasks, we further propose a lightweight adapter that utilizes versatile cues to fully exploit the extensive knowledge of the pre-trained model. Extensive experiments demonstrate that our approach attains a new state-of-the-art performance on both IQA and IAA tasks, while concurrently showcasing exceptional zero-shot and few-label image assessment capabilities. The source code will be available at https://github.com/zht8506/UniQA.

6/4/2024

Vision Language Modeling of Content, Distortion and Appearance for Image Quality Assessment

Fei Zhou, Zhicong Huang, Tianhao Gu, Guoping Qiu

The visual quality of an image is confounded by a number of intertwined factors including its semantic content, distortion characteristics and appearance properties such as brightness, contrast, sharpness, and colourfulness. Distilling high level knowledge about all these quality bearing attributes is crucial for developing objective Image Quality Assessment (IQA).While existing solutions have modeled some of these aspects, a comprehensive solution that involves all these important quality related attributes has not yet been developed. In this paper, we present a new blind IQA (BIQA) model termed Self-supervision and Vision-Language supervision Image QUality Evaluator (SLIQUE) that features a joint vision-language and visual contrastive representation learning framework for acquiring high level knowledge about the images semantic contents, distortion characteristics and appearance properties for IQA. For training SLIQUE, we have developed a systematic approach to constructing a first of its kind large image database annotated with all three categories of quality relevant texts. The Text Annotated Distortion, Appearance and Content (TADAC) database has over 1.6 million images annotated with textual descriptions of their semantic contents, distortion characteristics and appearance properties. The method for constructing TADAC and the database itself will be particularly useful for exploiting vision-language modeling for advanced IQA applications. Extensive experimental results show that SLIQUE has superior performances over state of the art, demonstrating the soundness of its design principle and the effectiveness of its implementation.

6/24/2024