Visual Verity in AI-Generated Imagery: Computational Metrics and Human-Centric Analysis

Read original: arXiv:2408.12762 - Published 9/4/2024 by Memoona Aziz, Umair Rehman, Syed Ali Safi, Amir Zaib Abbasi

🌐

Overview

The paper explores the evaluation of AI-generated images compared to camera-generated images.
The researchers introduced and validated a new questionnaire called Visual Verity to measure photorealism, image quality, and text-image alignment.
They applied the questionnaire to assess images from AI models (DALL-E2, DALL-E3, GLIDE, Stable Diffusion) and camera-generated images.
The paper also analyzed statistical properties and evaluated computational metrics' alignment with human judgments.

Plain English Explanation

The rapid advancements in AI technologies have revolutionized the production of graphical content across various sectors, such as entertainment, advertising, and e-commerce. As a result, there is a growing need for robust methods to evaluate the quality and realism of AI-generated images.

In this research, the authors introduced a new questionnaire called Visual Verity, which measures three key aspects of images: photorealism, image quality, and text-image alignment. They then used this questionnaire to assess images created by AI models (DALL-E2, DALL-E3, GLIDE, Stable Diffusion) and images captured by cameras.

The results showed that camera-generated images excelled in photorealism and text-image alignment, while the AI models produced images with higher overall quality. The researchers also found that camera-generated images scored lower in hue, saturation, and brightness compared to the AI-generated images.

Additionally, the study evaluated different computational metrics to see how well they aligned with the human judgments made using the Visual Verity questionnaire. The researchers identified MS-SSIM and CLIP as the most consistent with human assessments and proposed a new metric called the Neural Feature Similarity Score (NFSS) for assessing image quality.

The findings highlight the need to refine computational metrics to better capture human visual perception, which will help improve the evaluation of AI-generated content.

Technical Explanation

The researchers conducted three main studies to address the need for robust evaluation methods for AI-generated images.

First, they introduced and validated the Visual Verity questionnaire, which measures three key aspects of images: photorealism, image quality, and text-image alignment. This questionnaire was designed to provide a comprehensive assessment of AI-generated images.

Second, the researchers applied the Visual Verity questionnaire to images created by AI models (DALL-E2, DALL-E3, GLIDE, Stable Diffusion) and camera-generated images. Their analysis revealed that camera-generated images excelled in photorealism and text-image alignment, while the AI models produced images with higher overall quality. They also found that camera-generated images scored lower in hue, saturation, and brightness compared to the AI-generated images.

Third, the study evaluated the alignment between computational metrics and human judgments made using the Visual Verity questionnaire. The researchers identified MS-SSIM and CLIP as the most consistent with human assessments and proposed a new metric called the Neural Feature Similarity Score (NFSS) for assessing image quality.

Critical Analysis

The research presented in this paper makes important contributions to the field of AI-generated image evaluation. By introducing the Visual Verity questionnaire, the researchers have provided a comprehensive tool for assessing the quality and realism of AI-generated images.

However, the paper does not discuss potential limitations or caveats of the Visual Verity questionnaire. For example, it would be beneficial to understand how the questionnaire performs in evaluating images from a broader range of AI models or in different domains beyond the ones explored in the study.

Additionally, the analysis of computational metrics could be expanded to include a more in-depth discussion of the strengths and weaknesses of each metric, as well as their potential applications and limitations.

Further research could also investigate the relationship between the statistical properties of images (e.g., hue, saturation, brightness) and human perceptions of photorealism and quality. This could help refine the computational metrics and improve their alignment with human assessments.

Conclusion

This research highlights the need for robust evaluation methods to assess the quality and realism of AI-generated images. By introducing the Visual Verity questionnaire and analyzing the alignment between computational metrics and human judgments, the authors have made valuable contributions to the field.

The findings suggest that while AI models can produce images with high overall quality, they still face challenges in matching the photorealism and text-image alignment of camera-generated images. Continued refinement of computational metrics and a deeper understanding of human visual perception will be crucial for enhancing the evaluation and development of AI-generated content.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🌐

Visual Verity in AI-Generated Imagery: Computational Metrics and Human-Centric Analysis

Memoona Aziz, Umair Rehman, Syed Ali Safi, Amir Zaib Abbasi

The rapid advancements in AI technologies have revolutionized the production of graphical content across various sectors, including entertainment, advertising, and e-commerce. These developments have spurred the need for robust evaluation methods to assess the quality and realism of AI-generated images. To address this, we conducted three studies. First, we introduced and validated a questionnaire called Visual Verity, which measures photorealism, image quality, and text-image alignment. Second, we applied this questionnaire to assess images from AI models (DALL-E2, DALL-E3, GLIDE, Stable Diffusion) and camera-generated images, revealing that camera-generated images excelled in photorealism and text-image alignment, while AI models led in image quality. We also analyzed statistical properties, finding that camera-generated images scored lower in hue, saturation, and brightness. Third, we evaluated computational metrics' alignment with human judgments, identifying MS-SSIM and CLIP as the most consistent with human assessments. Additionally, we proposed the Neural Feature Similarity Score (NFSS) for assessing image quality. Our findings highlight the need for refining computational metrics to better capture human visual perception, thereby enhancing AI-generated content evaluation.

9/4/2024

🛸

GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation

Baiqi Li, Zhiqiu Lin, Deepak Pathak, Jiayao Li, Yixin Fei, Kewen Wu, Tiffany Ling, Xide Xia, Pengchuan Zhang, Graham Neubig, Deva Ramanan

While text-to-visual models now produce photo-realistic images and videos, they struggle with compositional text prompts involving attributes, relationships, and higher-order reasoning such as logic and comparison. In this work, we conduct an extensive human study on GenAI-Bench to evaluate the performance of leading image and video generation models in various aspects of compositional text-to-visual generation. We also compare automated evaluation metrics against our collected human ratings and find that VQAScore -- a metric measuring the likelihood that a VQA model views an image as accurately depicting the prompt -- significantly outperforms previous metrics such as CLIPScore. In addition, VQAScore can improve generation in a black-box manner (without finetuning) via simply ranking a few (3 to 9) candidate images. Ranking by VQAScore is 2x to 3x more effective than other scoring methods like PickScore, HPSv2, and ImageReward at improving human alignment ratings for DALL-E 3 and Stable Diffusion, especially on compositional prompts that require advanced visio-linguistic reasoning. We will release a new GenAI-Rank benchmark with over 40,000 human ratings to evaluate scoring metrics on ranking images generated from the same prompt. Lastly, we discuss promising areas for improvement in VQAScore, such as addressing fine-grained visual details. We will release all human ratings (over 80,000) to facilitate scientific benchmarking of both generative models and automated metrics.

6/26/2024

🔍

How to Distinguish AI-Generated Images from Authentic Photographs

Negar Kamali, Karyn Nakamura, Angelos Chatzimparmpas, Jessica Hullman, Matthew Groh

The high level of photorealism in state-of-the-art diffusion models like Midjourney, Stable Diffusion, and Firefly makes it difficult for untrained humans to distinguish between real photographs and AI-generated images. To address this problem, we designed a guide to help readers develop a more critical eye toward identifying artifacts, inconsistencies, and implausibilities that often appear in AI-generated images. The guide is organized into five categories of artifacts and implausibilities: anatomical, stylistic, functional, violations of physics, and sociocultural. For this guide, we generated 138 images with diffusion models, curated 9 images from social media, and curated 42 real photographs. These images showcase the kinds of cues that prompt suspicion towards the possibility an image is AI-generated and why it is often difficult to draw conclusions about an image's provenance without any context beyond the pixels in an image. Human-perceptible artifacts are not always present in AI-generated images, but this guide reveals artifacts and implausibilities that often emerge. By drawing attention to these kinds of artifacts and implausibilities, we aim to better equip people to distinguish AI-generated images from real photographs in the future.

6/14/2024

Evaluating Text-to-Image Synthesis: Survey and Taxonomy of Image Quality Metrics

Sebastian Hartwig, Dominik Engel, Leon Sick, Hannah Kniesel, Tristan Payer, Poonam Poonam, Michael Glockler, Alex Bauerle, Timo Ropinski

Recent AI-based text-to-image models not only excel at generating realistic images, they also give designers more and more fine-grained control over the image content. Consequently, these approaches have gathered increased attention within the computer graphics research community, which has been historically devoted towards traditional rendering techniques that offer precise control over scene parameters such as objects, materials, and lighting, when generating realistic images. While the quality of rendered images is traditionally assessed through well-established image quality metrics, such as SSIM or PSNR, the unique challenges presented by text-to-image models, which in contrast to rendering interweave the control of scene and rendering parameters, necessitate the development of novel image quality metrics. Therefore, within this survey, we provide a comprehensive overview of existing text-to-image quality metrics addressing their nuances and the need for alignment with human preferences. Based on our findings, we propose a new taxonomy for categorizing these metrics, which is grounded in the assumption that there are two main quality criteria, namely compositionality and generality, which ideally map to human preferences. Ultimately, we derive guidelines for practitioners conducting text-to-image evaluation, discuss open challenges of evaluation mechanisms, and surface limitations of current metrics.

7/24/2024