AGFSync: Leveraging AI-Generated Feedback for Preference Optimization in Text-to-Image Generation

2403.13352

Published 4/4/2024 by Jingkun An, Yinghao Zhu, Zongjian Li, Haoran Feng, Bohua Chen, Yemin Shi, Chengwei Pan

AGFSync: Leveraging AI-Generated Feedback for Preference Optimization in Text-to-Image Generation

Abstract

Text-to-Image (T2I) diffusion models have achieved remarkable success in image generation. Despite their progress, challenges remain in both prompt-following ability, image quality and lack of high-quality datasets, which are essential for refining these models. As acquiring labeled data is costly, we introduce AGFSync, a framework that enhances T2I diffusion models through Direct Preference Optimization (DPO) in a fully AI-driven approach. AGFSync utilizes Vision-Language Models (VLM) to assess image quality across style, coherence, and aesthetics, generating feedback data within an AI-driven loop. By applying AGFSync to leading T2I models such as SD v1.4, v1.5, and SDXL, our extensive experiments on the TIFA dataset demonstrate notable improvements in VQA scores, aesthetic evaluations, and performance on the HPSv2 benchmark, consistently outperforming the base models. AGFSync's method of refining T2I diffusion models paves the way for scalable alignment techniques.

Create account to get full access

Overview

The paper introduces AGFSync, a framework that leverages AI-generated feedback to optimize text-to-image generation models.
The approach aims to directly optimize model preferences towards user preferences, rather than just generating images based on prompts.
The system generates feedback using a separate AI model, and then incorporates that feedback to fine-tune the text-to-image generation model.

Plain English Explanation

The paper describes a new way to create images from text descriptions. Typically, these text-to-image models are trained on a large dataset of image-text pairs, and then use that training to generate new images based on text prompts. However, the resulting images may not always match what the user had in mind.

The AGFSync approach tries to solve this by having the model get "feedback" from another AI system. This feedback AI looks at the generated images and provides an assessment of how well they match the original text prompt. The text-to-image model can then use this feedback to adjust and refine the images it creates, making them better aligned with the user's preferences.

The key idea is to directly optimize the model's internal preferences based on the AI-generated feedback, rather than just relying on the initial text prompt. This allows the model to learn to generate images that are a closer fit to what the user wants, even if that's not exactly what the model was originally trained on.

Technical Explanation

The paper introduces a framework called AGFSync that leverages AI-generated feedback to directly optimize the preferences of text-to-image generation models. The core idea is to train a separate feedback model that can assess the quality of generated images relative to the original text prompt. This feedback is then used to fine-tune the text-to-image generation model, allowing it to learn to produce images that better match user preferences.

The system consists of three main components:

A text-to-image generation model, such as a diffusion model, that produces images from text prompts.
A feedback model that takes the generated images and text prompts as input, and outputs a quality score reflecting how well the image matches the prompt.
An optimization procedure that uses the feedback scores to fine-tune the parameters of the text-to-image generation model.

During training, the feedback model is first trained on a dataset of text-image pairs, learning to predict how well a given image matches its prompt. The text-to-image generation model is then trained using this feedback, adjusting its parameters to maximize the feedback scores for the images it produces.

The key innovation is this direct optimization of the generation model's preferences based on the AI-generated feedback, rather than just relying on the original text prompts. This allows the model to learn to generate images that better match user preferences, even if those preferences are not fully captured in the training data.

Critical Analysis

The paper presents a promising approach for improving text-to-image generation by incorporating AI-generated feedback to directly optimize the model's preferences. This addresses a key limitation of existing text-to-image systems, which can struggle to produce images that fully align with user intent.

That said, the paper does not provide a thorough evaluation of the technique's real-world performance and limitations. The experiments focus primarily on synthetic benchmarks, and it's unclear how well the approach would scale to more complex, open-ended text prompts and diverse user preferences.

Additionally, the reliance on a separate feedback model introduces additional complexity and potential failure modes. The quality and reliability of the feedback signals could be a critical factor in the effectiveness of the overall system.

Further research would be needed to understand the practical tradeoffs and constraints of deploying AGFSync in production settings. Potential areas for exploration include the sensitivity of the approach to feedback model accuracy, the scalability to large-scale text-to-image generation, and the generalization to diverse user preferences beyond the training distribution.

Conclusion

The AGFSync framework presented in this paper represents an intriguing step towards improving text-to-image generation by directly optimizing model preferences using AI-generated feedback. By incorporating this feedback signal, the approach aims to generate images that better align with user intent, going beyond the limitations of standard text-to-image models.

While the paper demonstrates promising results on synthetic benchmarks, further research would be needed to fully assess the real-world potential and constraints of this approach. Nonetheless, the core idea of leveraging AI-generated feedback to fine-tune generation models is a compelling direction that could have significant implications for the development of more powerful and user-centric text-to-image systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Boost Your Own Human Image Generation Model via Direct Preference Optimization with AI Feedback

Sanghyeon Na, Yonggyu Kim, Hyunjoon Lee

The generation of high-quality human images through text-to-image (T2I) methods is a significant yet challenging task. Distinct from general image generation, human image synthesis must satisfy stringent criteria related to human pose, anatomy, and alignment with textual prompts, making it particularly difficult to achieve realistic results. Recent advancements in T2I generation based on diffusion models have shown promise, yet challenges remain in meeting human-specific preferences. In this paper, we introduce a novel approach tailored specifically for human image generation utilizing Direct Preference Optimization (DPO). Specifically, we introduce an efficient method for constructing a specialized DPO dataset for training human image generation models without the need for costly human feedback. We also propose a modified loss function that enhances the DPO training process by minimizing artifacts and improving image fidelity. Our method demonstrates its versatility and effectiveness in generating human images, including personalized text-to-image generation. Through comprehensive evaluations, we show that our approach significantly advances the state of human image generation, achieving superior results in terms of natural anatomies, poses, and text-image alignment.

5/31/2024

cs.CV cs.AI cs.LG

🔗

A Dense Reward View on Aligning Text-to-Image Diffusion with Preference

Shentao Yang, Tianqi Chen, Mingyuan Zhou

Aligning text-to-image diffusion model (T2I) with preference has been gaining increasing research attention. While prior works exist on directly optimizing T2I by preference data, these methods are developed under the bandit assumption of a latent reward on the entire diffusion reverse chain, while ignoring the sequential nature of the generation process. This may harm the efficacy and efficiency of preference alignment. In this paper, we take on a finer dense reward perspective and derive a tractable alignment objective that emphasizes the initial steps of the T2I reverse chain. In particular, we introduce temporal discounting into DPO-style explicit-reward-free objectives, to break the temporal symmetry therein and suit the T2I generation hierarchy. In experiments on single and multiple prompt generation, our method is competitive with strong relevant baselines, both quantitatively and qualitatively. Further investigations are conducted to illustrate the insight of our approach.

5/14/2024

cs.CV

DiffAgent: Fast and Accurate Text-to-Image API Selection with Large Language Model

Lirui Zhao, Yue Yang, Kaipeng Zhang, Wenqi Shao, Yuxin Zhang, Yu Qiao, Ping Luo, Rongrong Ji

Text-to-image (T2I) generative models have attracted significant attention and found extensive applications within and beyond academic research. For example, the Civitai community, a platform for T2I innovation, currently hosts an impressive array of 74,492 distinct models. However, this diversity presents a formidable challenge in selecting the most appropriate model and parameters, a process that typically requires numerous trials. Drawing inspiration from the tool usage research of large language models (LLMs), we introduce DiffAgent, an LLM agent designed to screen the accurate selection in seconds via API calls. DiffAgent leverages a novel two-stage training framework, SFTA, enabling it to accurately align T2I API responses with user input in accordance with human preferences. To train and evaluate DiffAgent's capabilities, we present DABench, a comprehensive dataset encompassing an extensive range of T2I APIs from the community. Our evaluations reveal that DiffAgent not only excels in identifying the appropriate T2I API but also underscores the effectiveness of the SFTA training framework. Codes are available at https://github.com/OpenGVLab/DiffAgent.

4/3/2024

cs.CL cs.AI

Improving GFlowNets for Text-to-Image Diffusion Alignment

Dinghuai Zhang, Yizhe Zhang, Jiatao Gu, Ruixiang Zhang, Josh Susskind, Navdeep Jaitly, Shuangfei Zhai

Diffusion models have become the de-facto approach for generating visual data, which are trained to match the distribution of the training dataset. In addition, we also want to control generation to fulfill desired properties such as alignment to a text description, which can be specified with a black-box reward function. Prior works fine-tune pretrained diffusion models to achieve this goal through reinforcement learning-based algorithms. Nonetheless, they suffer from issues including slow credit assignment as well as low quality in their generated samples. In this work, we explore techniques that do not directly maximize the reward but rather generate high-reward images with relatively high probability -- a natural scenario for the framework of generative flow networks (GFlowNets). To this end, we propose the Diffusion Alignment with GFlowNet (DAG) algorithm to post-train diffusion models with black-box property functions. Extensive experiments on Stable Diffusion and various reward specifications corroborate that our method could effectively align large-scale text-to-image diffusion models with given reward information.

6/18/2024

cs.LG cs.AI cs.CV stat.ML