UNIAA: A Unified Multi-modal Image Aesthetic Assessment Baseline and Benchmark

Read original: arXiv:2404.09619 - Published 4/16/2024 by Zhaokun Zhou, Qiulin Wang, Bin Lin, Yiwei Su, Rui Chen, Xin Tao, Amin Zheng, Li Yuan, Pengfei Wan, Di Zhang

UNIAA: A Unified Multi-modal Image Aesthetic Assessment Baseline and Benchmark

Overview

This paper proposes a unified multi-modal image aesthetic assessment baseline and benchmark called UNIAA.
UNIAA aims to provide a comprehensive framework for evaluating image aesthetics using both visual and textual modalities.
The authors develop a large-scale dataset and evaluation protocols to establish UNIAA as a robust and standardized benchmark for the field.

Plain English Explanation

The paper focuses on developing a new system for automatically assessing the aesthetic quality of images. Traditionally, these types of aesthetic assessment systems have relied solely on analyzing the visual properties of an image. However, the researchers behind UNIAA believe that incorporating language-based analysis can provide a more holistic and accurate evaluation.

UNIAA works by combining visual and textual information to judge an image's aesthetic appeal. For example, it might analyze the composition, colors, and lighting of an image, while also considering any captions, descriptions, or commentary associated with the image. By bringing together these different modalities, the researchers hope to create a more well-rounded and reliable system for assessing image aesthetics.

To support this new approach, the paper also introduces a large dataset of images and their associated text-based information. This dataset serves as a benchmark that can be used to train and evaluate different aesthetic assessment models. The goal is to establish UNIAA as a standardized resource for researchers and developers working on automatic image evaluation systems.

Technical Explanation

The key technical components of UNIAA include:

Multi-modal Dataset: The authors curate a large-scale dataset of images paired with textual descriptions, captions, and user comments. This dataset serves as a comprehensive benchmark for evaluating multi-modal image aesthetic assessment models.
Evaluation Protocols: The paper defines several evaluation protocols to assess the performance of aesthetic assessment models on different tasks, such as predicting image aesthetic scores, ranking images by aesthetic appeal, and classifying images as "aesthetic" or "non-aesthetic."
Baseline Models: The authors establish several baseline models for UNIAA, including visual-only and multi-modal approaches. These baselines provide a starting point for researchers to benchmark their own models against.
Instruct Tuning: The paper explores the use of instruct tuning, a technique for fine-tuning large language models to perform specific tasks. The authors demonstrate that instruct tuning can be effective for improving the performance of multi-modal image aesthetic assessment models.

Critical Analysis

While UNIAA represents a significant step forward in developing multi-modal aesthetic assessment systems, the paper acknowledges several limitations and areas for further research:

The dataset, while large, may not capture the full diversity of aesthetic judgments and preferences across different cultures and demographics. Expanding the dataset to be more inclusive could enhance the generalizability of the UNIAA benchmark.
The baseline models presented in the paper, while providing a useful starting point, may not fully leverage the potential of multi-modal approaches. Exploring more advanced architectures and training techniques could lead to further improvements in aesthetic assessment performance.
The instruct tuning approach used in the paper is a relatively new and emerging technique. Further research is needed to understand its limitations and optimal application for multi-modal tasks like image aesthetic assessment.

Conclusion

The UNIAA framework proposed in this paper marks an important advancement in the field of automatic image aesthetic assessment. By combining visual and textual modalities, the researchers aim to create a more comprehensive and reliable system for evaluating the aesthetic qualities of images.

The introduction of the UNIAA dataset and evaluation protocols provides a valuable benchmark for the research community, enabling the development and comparison of various multi-modal aesthetic assessment models. Moreover, the exploration of instruct tuning suggests promising avenues for enhancing the performance of these models.

Overall, the UNIAA work represents a significant step towards bridging the gap between human-level aesthetic judgments and machine-based assessment, with potential applications in areas such as digital art, photography, and user-generated content curation.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

UNIAA: A Unified Multi-modal Image Aesthetic Assessment Baseline and Benchmark

Zhaokun Zhou, Qiulin Wang, Bin Lin, Yiwei Su, Rui Chen, Xin Tao, Amin Zheng, Li Yuan, Pengfei Wan, Di Zhang

As an alternative to expensive expert evaluation, Image Aesthetic Assessment (IAA) stands out as a crucial task in computer vision. However, traditional IAA methods are typically constrained to a single data source or task, restricting the universality and broader application. In this work, to better align with human aesthetics, we propose a Unified Multi-modal Image Aesthetic Assessment (UNIAA) framework, including a Multi-modal Large Language Model (MLLM) named UNIAA-LLaVA and a comprehensive benchmark named UNIAA-Bench. We choose MLLMs with both visual perception and language ability for IAA and establish a low-cost paradigm for transforming the existing datasets into unified and high-quality visual instruction tuning data, from which the UNIAA-LLaVA is trained. To further evaluate the IAA capability of MLLMs, we construct the UNIAA-Bench, which consists of three aesthetic levels: Perception, Description, and Assessment. Extensive experiments validate the effectiveness and rationality of UNIAA. UNIAA-LLaVA achieves competitive performance on all levels of UNIAA-Bench, compared with existing MLLMs. Specifically, our model performs better than GPT-4V in aesthetic perception and even approaches the junior-level human. We find MLLMs have great potential in IAA, yet there remains plenty of room for further improvement. The UNIAA-LLaVA and UNIAA-Bench will be released.

4/16/2024

UniQA: Unified Vision-Language Pre-training for Image Quality and Aesthetic Assessment

Hantao Zhou, Longxiang Tang, Rui Yang, Guanyi Qin, Yan Zhang, Runze Hu, Xiu Li

Image Quality Assessment (IQA) and Image Aesthetic Assessment (IAA) aim to simulate human subjective perception of image visual quality and aesthetic appeal. Existing methods typically address these tasks independently due to distinct learning objectives. However, they neglect the underlying interconnectedness of both tasks, which hinders the learning of task-agnostic shared representations for human subjective perception. To confront this challenge, we propose Unified vision-language pre-training of Quality and Aesthetics (UniQA), to learn general perceptions of two tasks, thereby benefiting them simultaneously. Addressing the absence of text in the IQA datasets and the presence of textual noise in the IAA datasets, (1) we utilize multimodal large language models (MLLMs) to generate high-quality text descriptions; (2) the generated text for IAA serves as metadata to purify noisy IAA data. To effectively adapt the pre-trained UniQA to downstream tasks, we further propose a lightweight adapter that utilizes versatile cues to fully exploit the extensive knowledge of the pre-trained model. Extensive experiments demonstrate that our approach attains a new state-of-the-art performance on both IQA and IAA tasks, while concurrently showcasing exceptional zero-shot and few-label image assessment capabilities. The source code will be available at https://github.com/zht8506/UniQA.

6/4/2024

Multi-modal Learnable Queries for Image Aesthetics Assessment

Zhiwei Xiong, Yunfan Zhang, Zhiqi Shen, Peiran Ren, Han Yu

Image aesthetics assessment (IAA) is attracting wide interest with the prevalence of social media. The problem is challenging due to its subjective and ambiguous nature. Instead of directly extracting aesthetic features solely from the image, user comments associated with an image could potentially provide complementary knowledge that is useful for IAA. With existing large-scale pre-trained models demonstrating strong capabilities in extracting high-quality transferable visual and textual features, learnable queries are shown to be effective in extracting useful features from the pre-trained visual features. Therefore, in this paper, we propose MMLQ, which utilizes multi-modal learnable queries to extract aesthetics-related features from multi-modal pre-trained features. Extensive experimental results demonstrate that MMLQ achieves new state-of-the-art performance on multi-modal IAA, beating previous methods by 7.7% and 8.3% in terms of SRCC and PLCC, respectively.

5/3/2024

💬

u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model

Jinjin Xu, Liwu Xu, Yuzhe Yang, Xiang Li, Fanyi Wang, Yanchun Xie, Yi-Jie Huang, Yaqian Li

Recent advancements in multi-modal large language models (MLLMs) have led to substantial improvements in visual understanding, primarily driven by sophisticated modality alignment strategies. However, predominant approaches prioritize global or regional comprehension, with less focus on fine-grained, pixel-level tasks. To address this gap, we introduce u-LLaVA, an innovative unifying multi-task framework that integrates pixel, regional, and global features to refine the perceptual faculties of MLLMs. We commence by leveraging an efficient modality alignment approach, harnessing both image and video datasets to bolster the model's foundational understanding across diverse visual contexts. Subsequently, a joint instruction tuning method with task-specific projectors and decoders for end-to-end downstream training is presented. Furthermore, this work contributes a novel mask-based multi-task dataset comprising 277K samples, crafted to challenge and assess the fine-grained perception capabilities of MLLMs. The overall framework is simple, effective, and achieves state-of-the-art performance across multiple benchmarks. We also make our model, data, and code publicly accessible at https://github.com/OPPOMKLab/u-LLaVA.

8/29/2024