EvalAlign: Evaluating Text-to-Image Models through Precision Alignment of Multimodal Large Models with Supervised Fine-Tuning to Human Annotations

2406.16562

Published 6/28/2024 by Zhiyu Tan, Xiaomeng Yang, Luozheng Qin, Mengping Yang, Cheng Zhang, Hao Li

EvalAlign: Evaluating Text-to-Image Models through Precision Alignment of Multimodal Large Models with Supervised Fine-Tuning to Human Annotations

Abstract

The recent advancements in text-to-image generative models have been remarkable. Yet, the field suffers from a lack of evaluation metrics that accurately reflect the performance of these models, particularly lacking fine-grained metrics that can guide the optimization of the models. In this paper, we propose EvalAlign, a metric characterized by its accuracy, stability, and fine granularity. Our approach leverages the capabilities of Multimodal Large Language Models (MLLMs) pre-trained on extensive datasets. We develop evaluation protocols that focus on two key dimensions: image faithfulness and text-image alignment. Each protocol comprises a set of detailed, fine-grained instructions linked to specific scoring options, enabling precise manual scoring of the generated images. We Supervised Fine-Tune (SFT) the MLLM to align closely with human evaluative judgments, resulting in a robust evaluation model. Our comprehensive tests across 24 text-to-image generation models demonstrate that EvalAlign not only provides superior metric stability but also aligns more closely with human preferences than existing metrics, confirming its effectiveness and utility in model assessment.

Create account to get full access

Overview

• This paper introduces EvalAlign, a supervised fine-tuning approach to train multimodal large language models (LLMs) on human-aligned data for the purpose of evaluating text-to-image (T2I) models.

• The key idea is to leverage human judgments on image-text alignment to fine-tune LLMs, enabling them to better assess the quality of T2I model outputs.

Plain English Explanation

• Modern AI systems can generate images from text descriptions, but evaluating the quality of these generated images is challenging. EvalAlign proposes a solution to this problem.

• The researchers fine-tuned large language models (LLMs) - powerful AI models trained on vast amounts of text data - using human ratings of how well images match their text descriptions. This "human-aligned" fine-tuning helps the LLMs better understand and assess the relationship between text and images.

• With this fine-tuned LLM, the researchers can now evaluate the performance of text-to-image (T2I) models - AI systems that generate images from text. The fine-tuned LLM can provide more accurate and nuanced judgments on how well the T2I model's outputs match the original text prompts.

• This approach allows for more reliable and interpretable evaluation of T2I models, which is crucial as these models become more sophisticated and widely deployed. It bridges the gap between human judgments and automated model evaluation.

Technical Explanation

• The researchers fine-tuned the AlignGPT multimodal LLM using a novel human-aligned dataset, consisting of image-text pairs and corresponding human ratings of alignment quality.

• This fine-tuning process allows the LLM to learn the nuances of how humans judge the relationship between text and images, going beyond simple imitation to capture more fine-grained aspects of image-text correspondence.

• The researchers evaluate EvalAlign on several T2I benchmarks, including AlignMMBench, demonstrating its ability to provide more human-aligned multimodal evaluation compared to existing metrics.

Critical Analysis

• The paper acknowledges that the success of EvalAlign relies on the availability and quality of the human-aligned dataset used for fine-tuning. Collecting such datasets can be labor-intensive and may introduce biases.

• While EvalAlign shows promise for evaluating T2I models, further research is needed to understand its generalization to other multimodal tasks beyond image-text alignment.

• The paper does not provide extensive comparisons to alternative evaluation approaches, such as human evaluation or other automated metrics. More comprehensive benchmarking could strengthen the case for EvalAlign's advantages.

Conclusion

• EvalAlign offers a novel approach to leveraging human judgments on image-text alignment to fine-tune multimodal LLMs for more reliable and interpretable evaluation of T2I models.

• This work represents an important step towards bridging the gap between human and automated evaluation of multimodal AI systems, with potential implications for the responsible development and deployment of these technologies.

• Continued research on human-aligned multimodal evaluation, dataset creation, and the broader applicability of approaches like EvalAlign will be crucial as AI systems become more sophisticated and integrated into our lives.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

💬

AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability

Fei Zhao, Taotian Pang, Chunhui Li, Zhen Wu, Junjie Guo, Shangyu Xing, Xinyu Dai

Multimodal Large Language Models (MLLMs) are widely regarded as crucial in the exploration of Artificial General Intelligence (AGI). The core of MLLMs lies in their capability to achieve cross-modal alignment. To attain this goal, current MLLMs typically follow a two-phase training paradigm: the pre-training phase and the instruction-tuning phase. Despite their success, there are shortcomings in the modeling of alignment capabilities within these models. Firstly, during the pre-training phase, the model usually assumes that all image-text pairs are uniformly aligned, but in fact the degree of alignment between different image-text pairs is inconsistent. Secondly, the instructions currently used for finetuning incorporate a variety of tasks, different tasks's instructions usually require different levels of alignment capabilities, but previous MLLMs overlook these differentiated alignment needs. To tackle these issues, we propose a new multimodal large language model AlignGPT. In the pre-training stage, instead of treating all image-text pairs equally, we assign different levels of alignment capabilities to different image-text pairs. Then, in the instruction-tuning phase, we adaptively combine these different levels of alignment capabilities to meet the dynamic alignment needs of different instructions. Extensive experimental results show that our model achieves competitive performance on 12 benchmarks.

5/24/2024

cs.CL cs.AI cs.CV

🏋️

Beyond Imitation: Leveraging Fine-grained Quality Signals for Alignment

Geyang Guo, Ranchi Zhao, Tianyi Tang, Wayne Xin Zhao, Ji-Rong Wen

Alignment with human preference is a desired property of large language models (LLMs). Currently, the main alignment approach is based on reinforcement learning from human feedback (RLHF). Despite the effectiveness of RLHF, it is intricate to implement and train, thus recent studies explore how to develop alternative alignment approaches based on supervised fine-tuning (SFT). A major limitation of SFT is that it essentially does imitation learning, which cannot fully understand what are the expected behaviors. To address this issue, we propose an improved alignment approach named FIGA. Different from prior methods, we incorporate fine-grained (i.e., token or phrase level) quality signals that are derived by contrasting good and bad responses. Our approach has made two major contributions. Firstly, we curate a refined alignment dataset that pairs initial responses and the corresponding revised ones. Secondly, we devise a new loss function can leverage fine-grained quality signals to instruct the learning of LLMs for alignment. Extensive experiments have demonstrated the effectiveness of our approaches by comparing a number of competitive baselines.

4/16/2024

cs.CL

🖼️

FINEMATCH: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction

Hang Hua, Jing Shi, Kushal Kafle, Simon Jenni, Daoan Zhang, John Collomosse, Scott Cohen, Jiebo Luo

Recent progress in large-scale pre-training has led to the development of advanced vision-language models (VLMs) with remarkable proficiency in comprehending and generating multimodal content. Despite the impressive ability to perform complex reasoning for VLMs, current models often struggle to effectively and precisely capture the compositional information on both the image and text sides. To address this, we propose FineMatch, a new aspect-based fine-grained text and image matching benchmark, focusing on text and image mismatch detection and correction. This benchmark introduces a novel task for boosting and evaluating the VLMs' compositionality for aspect-based fine-grained text and image matching. In this task, models are required to identify mismatched aspect phrases within a caption, determine the aspect's class, and propose corrections for an image-text pair that may contain between 0 and 3 mismatches. To evaluate the models' performance on this new task, we propose a new evaluation metric named ITM-IoU for which our experiments show a high correlation to human evaluation. In addition, we also provide a comprehensive experimental analysis of existing mainstream VLMs, including fully supervised learning and in-context learning settings. We have found that models trained on FineMatch demonstrate enhanced proficiency in detecting fine-grained text and image mismatches. Moreover, models (e.g., GPT-4V, Gemini Pro Vision) with strong abilities to perform multimodal in-context learning are not as skilled at fine-grained compositional image and text matching analysis. With FineMatch, we are able to build a system for text-to-image generation hallucination detection and correction.

4/24/2024

cs.CV cs.CL

AlignMMBench: Evaluating Chinese Multimodal Alignment in Large Vision-Language Models

Yuhang Wu, Wenmeng Yu, Yean Cheng, Yan Wang, Xiaohan Zhang, Jiazheng Xu, Ming Ding, Yuxiao Dong

Evaluating the alignment capabilities of large Vision-Language Models (VLMs) is essential for determining their effectiveness as helpful assistants. However, existing benchmarks primarily focus on basic abilities using nonverbal methods, such as yes-no and multiple-choice questions. In this paper, we address this gap by introducing AlignMMBench, a comprehensive alignment benchmark specifically designed for emerging Chinese VLMs. This benchmark is meticulously curated from real-world scenarios and Chinese Internet sources, encompassing thirteen specific tasks across three categories, and includes both single-turn and multi-turn dialogue scenarios. Incorporating a prompt rewrite strategy, AlignMMBench encompasses 1,054 images and 4,978 question-answer pairs. To facilitate the evaluation pipeline, we propose CritiqueVLM, a rule-calibrated evaluator that exceeds GPT-4's evaluation ability. Finally, we report the performance of representative VLMs on AlignMMBench, offering insights into the capabilities and limitations of different VLM architectures. All evaluation codes and data are available on https://alignmmbench.github.io.

6/17/2024

cs.CL cs.CV