UniQA: Unified Vision-Language Pre-training for Image Quality and Aesthetic Assessment

Read original: arXiv:2406.01069 - Published 6/4/2024 by Hantao Zhou, Longxiang Tang, Rui Yang, Guanyi Qin, Yan Zhang, Runze Hu, Xiu Li

UniQA: Unified Vision-Language Pre-training for Image Quality and Aesthetic Assessment

Overview

The paper proposes a unified vision-language pre-training model called UniQA for image quality and aesthetic assessment.
UniQA leverages large-scale vision-language data to learn a joint representation that can be fine-tuned for various image-related tasks.
The model outperforms state-of-the-art approaches on several benchmark datasets for image quality and aesthetic assessment.

Plain English Explanation

The researchers have developed a new AI model called UniQA that can evaluate the quality and aesthetic appeal of images. Unlike previous models that were trained only on one type of data, UniQA was pre-trained on a large amount of both visual and textual information. This allows it to learn a more comprehensive understanding of how humans perceive and judge images.

The key idea is to leverage the power of "vision-language" models, which are trained on a diverse range of image-text pairs. By learning the associations between visual features and linguistic descriptions, UniQA can develop a nuanced appreciation for what makes an image aesthetically pleasing or of high quality. This knowledge can then be applied to downstream tasks like assessing the quality of photos or evaluating the aesthetic appeal of art and design.

Compared to prior approaches, the researchers show that UniQA achieves state-of-the-art performance on several standard benchmarks for image quality assessment and aesthetic assessment. This suggests the power of their unified, multi-modal training approach.

Technical Explanation

The core innovation of UniQA is its use of large-scale vision-language pre-training to learn a joint representation for image quality and aesthetic assessment. The model architecture builds on the successful CLIP framework, which aligns image and text embeddings in a shared latent space.

UniQA extends CLIP by pre-training on an even broader corpus of visual and textual data, including not just image-caption pairs but also user reviews, aesthetic judgments, and technical image quality metrics. This allows the model to develop a rich understanding of the complex factors that contribute to image quality and aesthetics.

The researchers evaluate UniQA on a variety of benchmark tasks, including no-reference image quality assessment, aesthetic assessment, and subjective quality evaluation. Across these diverse domains, UniQA outperforms prior state-of-the-art methods, demonstrating the power of its unified, multi-modal training approach.

Critical Analysis

The paper presents a compelling approach to leveraging vision-language pre-training for image quality and aesthetic assessment. By incorporating a broad range of visual and textual data, UniQA is able to learn rich representations that capture the nuanced factors that influence human judgments of images.

However, the authors acknowledge some limitations of their work. For example, the pre-training data is primarily in English, so the model's performance may be less robust for non-English cultural contexts. Additionally, the evaluation is focused on standard benchmark datasets, so further research is needed to understand how UniQA would perform in real-world, unconstrained settings.

It would also be valuable to explore ways of making the model's decision-making more transparent and interpretable. Understanding the specific visual and linguistic cues that the model uses to assess image quality and aesthetics could provide important insights for both the technical development and practical deployment of such systems.

Conclusion

Overall, the UniQA model represents an exciting advancement in the field of multi-modal machine learning for image understanding. By combining vision and language, the researchers have developed a powerful tool for evaluating the quality and aesthetic appeal of images. This could have broad applications in domains like photography, art, design, and user-generated content moderation.

As AI systems become increasingly capable of assessing visual aesthetics, it will be important to consider the social implications and potential biases of such technologies. Ongoing research and responsible development will be crucial to ensure that these tools are deployed ethically and equitably. Nonetheless, the UniQA paper demonstrates the significant potential of unified vision-language models to advance our understanding and appreciation of the visual world.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

UniQA: Unified Vision-Language Pre-training for Image Quality and Aesthetic Assessment

Hantao Zhou, Longxiang Tang, Rui Yang, Guanyi Qin, Yan Zhang, Runze Hu, Xiu Li

Image Quality Assessment (IQA) and Image Aesthetic Assessment (IAA) aim to simulate human subjective perception of image visual quality and aesthetic appeal. Existing methods typically address these tasks independently due to distinct learning objectives. However, they neglect the underlying interconnectedness of both tasks, which hinders the learning of task-agnostic shared representations for human subjective perception. To confront this challenge, we propose Unified vision-language pre-training of Quality and Aesthetics (UniQA), to learn general perceptions of two tasks, thereby benefiting them simultaneously. Addressing the absence of text in the IQA datasets and the presence of textual noise in the IAA datasets, (1) we utilize multimodal large language models (MLLMs) to generate high-quality text descriptions; (2) the generated text for IAA serves as metadata to purify noisy IAA data. To effectively adapt the pre-trained UniQA to downstream tasks, we further propose a lightweight adapter that utilizes versatile cues to fully exploit the extensive knowledge of the pre-trained model. Extensive experiments demonstrate that our approach attains a new state-of-the-art performance on both IQA and IAA tasks, while concurrently showcasing exceptional zero-shot and few-label image assessment capabilities. The source code will be available at https://github.com/zht8506/UniQA.

6/4/2024

UNIAA: A Unified Multi-modal Image Aesthetic Assessment Baseline and Benchmark

Zhaokun Zhou, Qiulin Wang, Bin Lin, Yiwei Su, Rui Chen, Xin Tao, Amin Zheng, Li Yuan, Pengfei Wan, Di Zhang

As an alternative to expensive expert evaluation, Image Aesthetic Assessment (IAA) stands out as a crucial task in computer vision. However, traditional IAA methods are typically constrained to a single data source or task, restricting the universality and broader application. In this work, to better align with human aesthetics, we propose a Unified Multi-modal Image Aesthetic Assessment (UNIAA) framework, including a Multi-modal Large Language Model (MLLM) named UNIAA-LLaVA and a comprehensive benchmark named UNIAA-Bench. We choose MLLMs with both visual perception and language ability for IAA and establish a low-cost paradigm for transforming the existing datasets into unified and high-quality visual instruction tuning data, from which the UNIAA-LLaVA is trained. To further evaluate the IAA capability of MLLMs, we construct the UNIAA-Bench, which consists of three aesthetic levels: Perception, Description, and Assessment. Extensive experiments validate the effectiveness and rationality of UNIAA. UNIAA-LLaVA achieves competitive performance on all levels of UNIAA-Bench, compared with existing MLLMs. Specifically, our model performs better than GPT-4V in aesthetic perception and even approaches the junior-level human. We find MLLMs have great potential in IAA, yet there remains plenty of room for further improvement. The UNIAA-LLaVA and UNIAA-Bench will be released.

4/16/2024

🤷

Cross-IQA: Unsupervised Learning for Image Quality Assessment

Zhen Zhang

Automatic perception of image quality is a challenging problem that impacts billions of Internet and social media users daily. To advance research in this field, we propose a no-reference image quality assessment (NR-IQA) method termed Cross-IQA based on vision transformer(ViT) model. The proposed Cross-IQA method can learn image quality features from unlabeled image data. We construct the pretext task of synthesized image reconstruction to unsupervised extract the image quality information based ViT block. The pretrained encoder of Cross-IQA is used to fine-tune a linear regression model for score prediction. Experimental results show that Cross-IQA can achieve state-of-the-art performance in assessing the low-frequency degradation information (e.g., color change, blurring, etc.) of images compared with the classical full-reference IQA and NR-IQA under the same datasets.

5/8/2024

CLIP-Guided Attribute Aware Pretraining for Generalizable Image Quality Assessment

Daekyu Kwon, Dongyoung Kim, Sehwan Ki, Younghyun Jo, Hyong-Euk Lee, Seon Joo Kim

In no-reference image quality assessment (NR-IQA), the challenge of limited dataset sizes hampers the development of robust and generalizable models. Conventional methods address this issue by utilizing large datasets to extract rich representations for IQA. Also, some approaches propose vision language models (VLM) based IQA, but the domain gap between generic VLM and IQA constrains their scalability. In this work, we propose a novel pretraining framework that constructs a generalizable representation for IQA by selectively extracting quality-related knowledge from VLM and leveraging the scalability of large datasets. Specifically, we carefully select optimal text prompts for five representative image quality attributes and use VLM to generate pseudo-labels. Numerous attribute-aware pseudo-labels can be generated with large image datasets, allowing our IQA model to learn rich representations about image quality. Our approach achieves state-of-the-art performance on multiple IQA datasets and exhibits remarkable generalization capabilities. Leveraging these strengths, we propose several applications, such as evaluating image generation models and training image enhancement models, demonstrating our model's real-world applicability. We will make the code available for access.

6/4/2024