Who Evaluates the Evaluations? Objectively Scoring Text-to-Image Prompt Coherence Metrics with T2IScoreScore (TS2)

2404.04251

Published 5/24/2024 by Michael Saxon, Fatima Jahara, Mahsa Khoshnoodi, Yujie Lu, Aditya Sharma, William Yang Wang

Who Evaluates the Evaluations? Objectively Scoring Text-to-Image Prompt Coherence Metrics with T2IScoreScore (TS2)

Abstract

With advances in the quality of text-to-image (T2I) models has come interest in benchmarking their prompt faithfulness-the semantic coherence of generated images to the prompts they were conditioned on. A variety of T2I faithfulness metrics have been proposed, leveraging advances in cross-modal embeddings and vision-language models (VLMs). However, these metrics are not rigorously compared and benchmarked, instead presented with correlation to human Likert scores over a set of easy-to-discriminate images against seemingly weak baselines. We introduce T2IScoreScore (TS2), a curated set of semantic error graphs containing a prompt and a set increasingly erroneous images. These allow us to rigorously judge whether a given prompt faithfulness metric can correctly order images with respect to their objective error count and significantly discriminate between different error nodes, using meta-metric scores derived from established statistical tests. Surprisingly, we find that the state-of-the-art VLM-based metrics (e.g., TIFA, DSG, LLMScore, VIEScore) we tested fail to significantly outperform simple (and supposedly worse) feature-based metrics like CLIPScore, particularly on a hard subset of naturally-occurring T2I model errors. TS2 will enable the development of better T2I prompt faithfulness metrics through more rigorous comparison of their conformity to expected orderings and separations under objective criteria.

Create account to get full access

Overview

The paper proposes a framework for objectively evaluating text-to-image prompt coherence metrics, which measure how well a text prompt aligns with the generated image.
The authors develop a new metric called (\gradientRGBTS268,67,14761,130,217) that aims to provide a more reliable and unbiased assessment of prompt coherence.
The paper compares (\gradientRGBTS268,67,14761,130,217) to existing metrics and shows that it better captures human judgments of prompt-image alignment.

Plain English Explanation

When you give a text prompt to an AI system that generates images, you want the generated image to match the prompt as closely as possible. The degree to which the image matches the prompt is called the "prompt coherence." Measuring prompt coherence is important for evaluating and improving these AI image generation systems.

The authors of this paper noticed that existing metrics for measuring prompt coherence have some issues - they may not always align with how humans judge the coherence between a prompt and its generated image. To address this, the researchers developed a new metric called (\gradientRGBTS268,67,14761,130,217) that aims to provide a more objective and reliable way to assess prompt coherence.

The key idea behind (\gradientRGBTS268,67,14761,130,217) is to capture the visual and semantic relationships between the prompt and the generated image, rather than just looking at surface-level features. The authors show that (\gradientRGBTS268,67,14761,130,217) is better able to match human judgments of prompt-image coherence compared to existing metrics.

Technical Explanation

The paper first reviews existing metrics for text-to-image prompt coherence, such as CLIP, ALIGN, and BLIP. It identifies several limitations of these metrics, including their inability to capture more nuanced semantic and visual relationships between prompts and images.

To address these issues, the authors propose a new metric called (\gradientRGBTS268,67,14761,130,217). This metric combines visual and semantic information by leveraging neural network models trained on large-scale image-text data. The key insight is to not just look at surface-level features, but to capture deeper relationships between the prompt and the generated image.

The authors conduct extensive experiments to evaluate (\gradientRGBTS268,67,14761,130,217) against existing metrics. They show that (\gradientRGBTS268,67,14761,130,217) better aligns with human judgments of prompt-image coherence, as measured through crowd-sourced evaluations. They also demonstrate the reliability and confidence-awareness of (\gradientRGBTS268,67,14761,130,217) through further analysis.

Critical Analysis

The paper presents a well-designed and thorough evaluation of the proposed (\gradientRGBTS268,67,14761,130,217) metric. The authors acknowledge the limitations of existing prompt coherence metrics and make a compelling case for the need for a more robust and reliable measure.

One potential concern is the complexity of the (\gradientRGBTS268,67,14761,130,217) metric, which may make it challenging to interpret and apply in practice. The authors do provide detailed explanations, but further efforts to simplify the metric or provide intuitive visualizations could improve its accessibility.

Additionally, the paper focuses on a specific task and dataset, and it would be valuable to see how (\gradientRGBTS268,67,14761,130,217) performs on a broader range of text-to-image generation scenarios. Extending the evaluation to different domains, languages, or generation models could further strengthen the claims and demonstrate the generalizability of the approach.

Conclusion

This paper presents a novel metric, (\gradientRGBTS268,67,14761,130,217), for objectively evaluating text-to-image prompt coherence. By capturing both visual and semantic relationships between prompts and generated images, (\gradientRGBTS268,67,14761,130,217) provides a more reliable and unbiased assessment compared to existing metrics.

The results demonstrate the effectiveness of (\gradientRGBTS268,67,14761,130,217) in aligning with human judgments of prompt-image coherence, which has important implications for the development and evaluation of AI-powered text-to-image generation systems. As these systems become more sophisticated and widely deployed, robust and trustworthy evaluation metrics will be crucial for ensuring their reliability and safety.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🧠

Revisiting Text-to-Image Evaluation with Gecko: On Metrics, Prompts, and Human Ratings

Olivia Wiles, Chuhan Zhang, Isabela Albuquerque, Ivana Kaji'c, Su Wang, Emanuele Bugliarello, Yasumasa Onoe, Chris Knutsen, Cyrus Rashtchian, Jordi Pont-Tuset, Aida Nematzadeh

While text-to-image (T2I) generative models have become ubiquitous, they do not necessarily generate images that align with a given prompt. While previous work has evaluated T2I alignment by proposing metrics, benchmarks, and templates for collecting human judgements, the quality of these components is not systematically measured. Human-rated prompt sets are generally small and the reliability of the ratings -- and thereby the prompt set used to compare models -- is not evaluated. We address this gap by performing an extensive study evaluating auto-eval metrics and human templates. We provide three main contributions: (1) We introduce a comprehensive skills-based benchmark that can discriminate models across different human templates. This skills-based benchmark categorises prompts into sub-skills, allowing a practitioner to pinpoint not only which skills are challenging, but at what level of complexity a skill becomes challenging. (2) We gather human ratings across four templates and four T2I models for a total of >100K annotations. This allows us to understand where differences arise due to inherent ambiguity in the prompt and where they arise due to differences in metric and model quality. (3) Finally, we introduce a new QA-based auto-eval metric that is better correlated with human ratings than existing metrics for our new dataset, across different human templates, and on TIFA160.

4/26/2024

cs.CV

📈

How to Evaluate Semantic Communications for Images with ViTScore Metric?

Tingting Zhu, Bo Peng, Jifan Liang, Tingchen Han, Hai Wan, Jingqiao Fu, Junjie Chen

Semantic communications (SC) have been expected to be a new paradigm shifting to catalyze the next generation communication, whose main concerns shift from accurate bit transmission to effective semantic information exchange in communications. However, the previous and widely-used metrics for images are not applicable to evaluate the image semantic similarity in SC. Classical metrics to measure the similarity between two images usually rely on the pixel level or the structural level, such as the PSNR and the MS-SSIM. Straightforwardly using some tailored metrics based on deep-learning methods in CV community, such as the LPIPS, is infeasible for SC. To tackle this, inspired by BERTScore in NLP community, we propose a novel metric for evaluating image semantic similarity, named Vision Transformer Score (ViTScore). We prove theoretically that ViTScore has 3 important properties, including symmetry, boundedness, and normalization, which make ViTScore convenient and intuitive for image measurement. To evaluate the performance of ViTScore, we compare ViTScore with 3 typical metrics (PSNR, MS-SSIM, and LPIPS) through 4 classes of experiments: (i) correlation with BERTScore through evaluation of image caption downstream CV task, (ii) evaluation in classical image communications, (iii) evaluation in image semantic communication systems, and (iv) evaluation in image semantic communication systems with semantic attack. Experimental results demonstrate that ViTScore is robust and efficient in evaluating the semantic similarity of images. Particularly, ViTScore outperforms the other 3 typical metrics in evaluating the image semantic changes by semantic attack, such as image inverse with Generative Adversarial Networks (GANs). This indicates that ViTScore is an effective performance metric when deployed in SC scenarios.

4/23/2024

cs.CV cs.AI cs.IT

🛸

Evaluating Text-to-Visual Generation with Image-to-Text Generation

Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, Deva Ramanan

Despite significant progress in generative AI, comprehensive evaluation remains challenging because of the lack of effective metrics and standardized benchmarks. For instance, the widely-used CLIPScore measures the alignment between a (generated) image and text prompt, but it fails to produce reliable scores for complex prompts involving compositions of objects, attributes, and relations. One reason is that text encoders of CLIP can notoriously act as a bag of words, conflating prompts such as the horse is eating the grass with the grass is eating the horse. To address this, we introduce the VQAScore, which uses a visual-question-answering (VQA) model to produce an alignment score by computing the probability of a Yes answer to a simple Does this figure show '{text}'? question. Though simpler than prior art, VQAScore computed with off-the-shelf models produces state-of-the-art results across many (8) image-text alignment benchmarks. We also compute VQAScore with an in-house model that follows best practices in the literature. For example, we use a bidirectional image-question encoder that allows image embeddings to depend on the question being asked (and vice versa). Our in-house model, CLIP-FlanT5, outperforms even the strongest baselines that make use of the proprietary GPT-4V. Interestingly, although we train with only images, VQAScore can also align text with video and 3D models. VQAScore allows researchers to benchmark text-to-visual generation using complex texts that capture the compositional structure of real-world prompts. We introduce GenAI-Bench, a more challenging benchmark with 1,600 compositional text prompts that require parsing scenes, objects, attributes, relationships, and high-order reasoning like comparison and logic. GenAI-Bench also offers over 15,000 human ratings for leading image and video generation models such as Stable Diffusion, DALL-E 3, and Gen2.

6/19/2024

cs.CV cs.AI cs.CL cs.LG cs.MM

Position: Towards Implicit Prompt For Text-To-Image Models

Yue Yang, Yuqi Lin, Hong Liu, Wenqi Shao, Runjian Chen, Hailong Shang, Yu Wang, Yu Qiao, Kaipeng Zhang, Ping Luo

Recent text-to-image (T2I) models have had great success, and many benchmarks have been proposed to evaluate their performance and safety. However, they only consider explicit prompts while neglecting implicit prompts (hint at a target without explicitly mentioning it). These prompts may get rid of safety constraints and pose potential threats to the applications of these models. This position paper highlights the current state of T2I models toward implicit prompts. We present a benchmark named ImplicitBench and conduct an investigation on the performance and impacts of implicit prompts with popular T2I models. Specifically, we design and collect more than 2,000 implicit prompts of three aspects: General Symbols, Celebrity Privacy, and Not-Safe-For-Work (NSFW) Issues, and evaluate six well-known T2I models' capabilities under these implicit prompts. Experiment results show that (1) T2I models are able to accurately create various target symbols indicated by implicit prompts; (2) Implicit prompts bring potential risks of privacy leakage for T2I models. (3) Constraints of NSFW in most of the evaluated T2I models can be bypassed with implicit prompts. We call for increased attention to the potential and risks of implicit prompts in the T2I community and further investigation into the capabilities and impacts of implicit prompts, advocating for a balanced approach that harnesses their benefits while mitigating their risks.

5/29/2024

cs.CY cs.AI cs.CV