GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation

2406.13743

Published 6/26/2024 by Baiqi Li, Zhiqiu Lin, Deepak Pathak, Jiayao Li, Yixin Fei, Kewen Wu, Tiffany Ling, Xide Xia, Pengchuan Zhang, Graham Neubig and 1 other

cs.CV cs.AI cs.CL cs.LG cs.MM

🛸

Abstract

While text-to-visual models now produce photo-realistic images and videos, they struggle with compositional text prompts involving attributes, relationships, and higher-order reasoning such as logic and comparison. In this work, we conduct an extensive human study on GenAI-Bench to evaluate the performance of leading image and video generation models in various aspects of compositional text-to-visual generation. We also compare automated evaluation metrics against our collected human ratings and find that VQAScore -- a metric measuring the likelihood that a VQA model views an image as accurately depicting the prompt -- significantly outperforms previous metrics such as CLIPScore. In addition, VQAScore can improve generation in a black-box manner (without finetuning) via simply ranking a few (3 to 9) candidate images. Ranking by VQAScore is 2x to 3x more effective than other scoring methods like PickScore, HPSv2, and ImageReward at improving human alignment ratings for DALL-E 3 and Stable Diffusion, especially on compositional prompts that require advanced visio-linguistic reasoning. We will release a new GenAI-Rank benchmark with over 40,000 human ratings to evaluate scoring metrics on ranking images generated from the same prompt. Lastly, we discuss promising areas for improvement in VQAScore, such as addressing fine-grained visual details. We will release all human ratings (over 80,000) to facilitate scientific benchmarking of both generative models and automated metrics.

Create account to get full access

Overview

Current text-to-visual models can produce highly realistic images and videos, but struggle with more complex tasks involving attributes, relationships, and higher-order reasoning.
This research conducted a large-scale human study to evaluate the performance of leading image and video generation models on various aspects of compositional text-to-visual generation.
The study also compared automated evaluation metrics against human ratings, finding that the VQAScore metric significantly outperformed previous methods.
VQAScore can also be used to improve generation in a black-box manner by ranking candidate images, which is more effective than other scoring approaches.
The researchers will release a new GenAI-Rank benchmark with over 40,000 human ratings to evaluate scoring metrics.

Plain English Explanation

Current artificial intelligence (AI) models can create photos and videos that look very realistic when given text instructions. However, these models struggle with more complex tasks that involve attributes, relationships, and higher-level reasoning.

This research project conducted a large-scale study to test how well leading image and video generation models perform on these types of complex, compositional text-to-visual tasks. The researchers also compared different automated metrics that can be used to evaluate the quality of the generated images and videos. They found that a metric called VQAScore significantly outperformed other evaluation methods.

Interestingly, the researchers discovered that VQAScore can be used to improve the quality of generated images and videos, even without further training the underlying AI models. Simply ranking a small number of candidate images based on their VQAScore was much more effective at improving human ratings than other scoring approaches.

To help advance research in this area, the researchers will be releasing a new GenAI-Rank benchmark, which includes over 40,000 human ratings of generated images and videos. This will provide a valuable resource for evaluating and improving AI text-to-visual generation models.

Technical Explanation

The paper focuses on the challenge of compositional text-to-visual generation, where AI models must translate complex text prompts involving attributes, relationships, and higher-order reasoning into corresponding visual outputs.

To evaluate the performance of leading image and video generation models on these tasks, the researchers conducted a large human study using the GenAI-Bench benchmark. Participants were asked to rate the alignment between text prompts and the generated images/videos across a variety of compositional dimensions.

The study also compared the performance of several automated evaluation metrics, including VQAScore, which measures the likelihood that a visual question-answering (VQA) model would accurately depict the input prompt. The researchers found that VQAScore significantly outperformed previous metrics like CLIPScore at predicting human ratings.

Importantly, the paper shows that VQAScore can be used to improve generation quality in a black-box manner, without fine-tuning the underlying models. By simply ranking a small set of candidate images based on their VQAScore, the researchers were able to achieve 2-3x better improvements in human alignment ratings compared to other scoring methods like PickScore and HPSv2.

To facilitate further research in this area, the researchers will be releasing the GenAI-Rank benchmark, which includes over 40,000 human ratings of generated images and videos. This resource will be valuable for evaluating and improving automated metrics and generation models.

Critical Analysis

While the paper makes a strong case for the effectiveness of VQAScore as an evaluation metric and generation guidance tool, the researchers acknowledge that it still has room for improvement, particularly in capturing fine-grained visual details.

Additionally, the study focused on a limited set of generation models and prompts, so it remains to be seen how well the findings would generalize to a broader range of AI systems and text-to-visual tasks. The researchers also note that their human study, while extensive, may still be subject to biases and limitations inherent in crowdsourced evaluations.

Future research could explore ways to further refine VQAScore and other automated metrics to better align with human judgments, particularly on more complex, multi-faceted prompts. Investigating the underlying reasons for the performance gaps between different models and metrics could also yield valuable insights to guide the development of more capable text-to-visual generation systems.

Conclusion

This research represents an important step forward in understanding the capabilities and limitations of current text-to-visual generation models, as well as the effectiveness of automated evaluation metrics. The finding that VQAScore significantly outperforms previous methods, and can be used to improve generation quality in a practical, black-box manner, is a promising development.

The release of the GenAI-Rank benchmark will provide a valuable resource for the research community to further explore these issues and advance the state of the art in compositional text-to-visual generation. Continued progress in this area could lead to more powerful and versatile AI-generated content that can better align with human understanding and reasoning.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🛸

Evaluating Text-to-Visual Generation with Image-to-Text Generation

Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, Deva Ramanan

Despite significant progress in generative AI, comprehensive evaluation remains challenging because of the lack of effective metrics and standardized benchmarks. For instance, the widely-used CLIPScore measures the alignment between a (generated) image and text prompt, but it fails to produce reliable scores for complex prompts involving compositions of objects, attributes, and relations. One reason is that text encoders of CLIP can notoriously act as a bag of words, conflating prompts such as the horse is eating the grass with the grass is eating the horse. To address this, we introduce the VQAScore, which uses a visual-question-answering (VQA) model to produce an alignment score by computing the probability of a Yes answer to a simple Does this figure show '{text}'? question. Though simpler than prior art, VQAScore computed with off-the-shelf models produces state-of-the-art results across many (8) image-text alignment benchmarks. We also compute VQAScore with an in-house model that follows best practices in the literature. For example, we use a bidirectional image-question encoder that allows image embeddings to depend on the question being asked (and vice versa). Our in-house model, CLIP-FlanT5, outperforms even the strongest baselines that make use of the proprietary GPT-4V. Interestingly, although we train with only images, VQAScore can also align text with video and 3D models. VQAScore allows researchers to benchmark text-to-visual generation using complex texts that capture the compositional structure of real-world prompts. We introduce GenAI-Bench, a more challenging benchmark with 1,600 compositional text prompts that require parsing scenes, objects, attributes, relationships, and high-order reasoning like comparison and logic. GenAI-Bench also offers over 15,000 human ratings for leading image and video generation models such as Stable Diffusion, DALL-E 3, and Gen2.

6/19/2024

cs.CV cs.AI cs.CL cs.LG cs.MM

TC-Bench: Benchmarking Temporal Compositionality in Text-to-Video and Image-to-Video Generation

Weixi Feng, Jiachen Li, Michael Saxon, Tsu-jui Fu, Wenhu Chen, William Yang Wang

Video generation has many unique challenges beyond those of image generation. The temporal dimension introduces extensive possible variations across frames, over which consistency and continuity may be violated. In this study, we move beyond evaluating simple actions and argue that generated videos should incorporate the emergence of new concepts and their relation transitions like in real-world videos as time progresses. To assess the Temporal Compositionality of video generation models, we propose TC-Bench, a benchmark of meticulously crafted text prompts, corresponding ground truth videos, and robust evaluation metrics. The prompts articulate the initial and final states of scenes, effectively reducing ambiguities for frame development and simplifying the assessment of transition completion. In addition, by collecting aligned real-world videos corresponding to the prompts, we expand TC-Bench's applicability from text-conditional models to image-conditional ones that can perform generative frame interpolation. We also develop new metrics to measure the completeness of component transitions in generated videos, which demonstrate significantly higher correlations with human judgments than existing metrics. Our comprehensive experimental results reveal that most video generators achieve less than 20% of the compositional changes, highlighting enormous space for future improvement. Our analysis indicates that current video generation models struggle to interpret descriptions of compositional changes and synthesize various components across different time steps.

6/14/2024

cs.CV cs.AI cs.CL

🏷️

Open-ended VQA benchmarking of Vision-Language models by exploiting Classification datasets and their semantic hierarchy

Simon Ging, Mar'ia A. Bravo, Thomas Brox

The evaluation of text-generative vision-language models is a challenging yet crucial endeavor. By addressing the limitations of existing Visual Question Answering (VQA) benchmarks and proposing innovative evaluation methodologies, our research seeks to advance our understanding of these models' capabilities. We propose a novel VQA benchmark based on well-known visual classification datasets which allows a granular evaluation of text-generative vision-language models and their comparison with discriminative vision-language models. To improve the assessment of coarse answers on fine-grained classification tasks, we suggest using the semantic hierarchy of the label space to ask automatically generated follow-up questions about the ground-truth category. Finally, we compare traditional NLP and LLM-based metrics for the problem of evaluating model predictions given ground-truth answers. We perform a human evaluation study upon which we base our decision on the final metric. We apply our benchmark to a suite of vision-language models and show a detailed comparison of their abilities on object, action, and attribute classification. Our contributions aim to lay the foundation for more precise and meaningful assessments, facilitating targeted progress in the exciting field of vision-language modeling.

5/7/2024

cs.CV cs.CL cs.LG

🛸

Training-free Subject-Enhanced Attention Guidance for Compositional Text-to-image Generation

Shengyuan Liu, Bo Wang, Ye Ma, Te Yang, Xipeng Cao, Quan Chen, Han Li, Di Dong, Peng Jiang

Existing subject-driven text-to-image generation models suffer from tedious fine-tuning steps and struggle to maintain both text-image alignment and subject fidelity. For generating compositional subjects, it often encounters problems such as object missing and attribute mixing, where some subjects in the input prompt are not generated or their attributes are incorrectly combined. To address these limitations, we propose a subject-driven generation framework and introduce training-free guidance to intervene in the generative process during inference time. This approach strengthens the attention map, allowing for precise attribute binding and feature injection for each subject. Notably, our method exhibits exceptional zero-shot generation ability, especially in the challenging task of compositional generation. Furthermore, we propose a novel metric GroundingScore to evaluate subject alignment thoroughly. The obtained quantitative results serve as compelling evidence showcasing the effectiveness of our proposed method. The code will be released soon.

5/14/2024

cs.CV