BeyondScene: Higher-Resolution Human-Centric Scene Generation With Pretrained Diffusion

Read original: arXiv:2404.04544 - Published 4/9/2024 by Gwanghyun Kim, Hayeon Kim, Hoigi Seo, Dong Un Kang, Se Young Chun

BeyondScene: Higher-Resolution Human-Centric Scene Generation With Pretrained Diffusion

Overview

This paper presents "BeyondScene", a novel text-to-image diffusion model for generating high-resolution, human-centric scenes.
The model leverages pre-trained diffusion models to produce realistic and detailed scenes centered around human figures.
The authors highlight the model's ability to generate diverse and coherent scenes that go "beyond" typical text-to-image approaches.

Plain English Explanation

"BeyondScene" is a new AI system that can create realistic, detailed images based on text descriptions. Unlike previous text-to-image models, BeyondScene is specifically designed to generate scenes that are centered around human figures.

The key innovation is that BeyondScene uses pre-trained "diffusion" models as the foundation. Diffusion models are a type of AI that can generate highly realistic images by gradually adding and then removing "noise" from an image. By starting with these pre-trained diffusion models, BeyondScene is able to create detailed, high-resolution images that seamlessly integrate human subjects into the surrounding scene.

The researchers show that BeyondScene can generate a wide variety of human-centric scenes, from everyday indoor and outdoor settings to more fantastical or imaginative environments. The generated images appear natural and coherent, with the human figures blending naturally into the overall scene.

Overall, BeyondScene represents an important advance in text-to-image generation, pushing the boundaries of what's possible by focusing on the human element and leveraging powerful pre-trained diffusion models. This could have applications in areas like visual storytelling, virtual worlds, and human-AI interaction.

Technical Explanation

The core of BeyondScene is a diffusion model architecture that builds on top of pre-trained diffusion models like Stable Diffusion and Morphable Diffusion. These pre-trained models are used as a starting point, and the researchers then fine-tune and adapt them to better handle the generation of high-resolution, human-centric scenes.

Key innovations include:

A novel "human-aware" diffusion process that conditions the generation on the presence and properties of human figures
Techniques for upsampling and scaling up the diffusion models to produce high-quality, high-resolution outputs
Strategies for reconstructing and "restoring" human bodies and figures within the generated scenes

The researchers evaluate BeyondScene on a variety of human-centric scene generation benchmarks, demonstrating its ability to outperform previous state-of-the-art approaches. The generated images show a high degree of realism, diversity, and coherence, with the human elements seamlessly integrated into the overall scene.

Critical Analysis

The researchers acknowledge several limitations and avenues for future work. For example, while BeyondScene excels at generating detailed human-centric scenes, it may struggle with more abstract or complex scene compositions. Additionally, the model's reliance on pre-trained diffusion models means that it inherits any biases or limitations present in those underlying models.

Further research could explore ways to make the model more robust, controllable, and generalizable. Potential directions include developing more sophisticated techniques for handling the human figure, exploring alternative diffusion model architectures, and investigating ways to better incorporate high-level semantic and compositional reasoning.

Overall, BeyondScene represents a significant step forward in text-to-image generation, with a clear focus on the human element. As the field of AI-generated imagery continues to evolve, approaches like BeyondScene that prioritize human-centric and contextually-aware scene generation will likely become increasingly important.

Conclusion

The BeyondScene paper presents a novel text-to-image diffusion model that excels at generating high-resolution, human-centric scenes. By building on top of pre-trained diffusion models and incorporating human-aware techniques, the researchers have demonstrated the ability to create diverse and coherent images that seamlessly integrate human figures into their surrounding environments.

This work represents an important advance in the field of text-to-image generation, pushing the boundaries of what's possible and opening up new avenues for applications in areas like virtual worlds, visual storytelling, and human-AI interaction. While there are still limitations and opportunities for further research, BeyondScene serves as a compelling example of how AI systems can be designed to better capture and represent the human experience within generated imagery.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

BeyondScene: Higher-Resolution Human-Centric Scene Generation With Pretrained Diffusion

Gwanghyun Kim, Hayeon Kim, Hoigi Seo, Dong Un Kang, Se Young Chun

Generating higher-resolution human-centric scenes with details and controls remains a challenge for existing text-to-image diffusion models. This challenge stems from limited training image size, text encoder capacity (limited tokens), and the inherent difficulty of generating complex scenes involving multiple humans. While current methods attempted to address training size limit only, they often yielded human-centric scenes with severe artifacts. We propose BeyondScene, a novel framework that overcomes prior limitations, generating exquisite higher-resolution (over 8K) human-centric scenes with exceptional text-image correspondence and naturalness using existing pretrained diffusion models. BeyondScene employs a staged and hierarchical approach to initially generate a detailed base image focusing on crucial elements in instance creation for multiple humans and detailed descriptions beyond token limit of diffusion model, and then to seamlessly convert the base image to a higher-resolution output, exceeding training image size and incorporating details aware of text and instances via our novel instance-aware hierarchical enlargement process that consists of our proposed high-frequency injected forward diffusion and adaptive joint diffusion. BeyondScene surpasses existing methods in terms of correspondence with detailed text descriptions and naturalness, paving the way for advanced applications in higher-resolution human-centric scene creation beyond the capacity of pretrained diffusion models without costly retraining. Project page: https://janeyeon.github.io/beyond-scene.

4/9/2024

DiffuseHigh: Training-free Progressive High-Resolution Image Synthesis through Structure Guidance

Younghyun Kim, Geunmin Hwang, Junyu Zhang, Eunbyung Park

Large-scale generative models, such as text-to-image diffusion models, have garnered widespread attention across diverse domains due to their creative and high-fidelity image generation. Nonetheless, existing large-scale diffusion models are confined to generating images of up to 1K resolution, which is far from meeting the demands of contemporary commercial applications. Directly sampling higher-resolution images often yields results marred by artifacts such as object repetition and distorted shapes. Addressing the aforementioned issues typically necessitates training or fine-tuning models on higher-resolution datasets. However, this poses a formidable challenge due to the difficulty in collecting large-scale high-resolution images and substantial computational resources. While several preceding works have proposed alternatives to bypass the cumbersome training process, they often fail to produce convincing results. In this work, we probe the generative ability of diffusion models at higher resolution beyond their original capability and propose a novel progressive approach that fully utilizes generated low-resolution images to guide the generation of higher-resolution images. Our method obviates the need for additional training or fine-tuning which significantly lowers the burden of computational costs. Extensive experiments and results validate the efficiency and efficacy of our method. Project page: https://yhyun225.github.io/DiffuseHigh/

8/28/2024

🐍

High-fidelity Person-centric Subject-to-Image Synthesis

Yibin Wang, Weizhong Zhang, Jianwei Zheng, Cheng Jin

Current subject-driven image generation methods encounter significant challenges in person-centric image generation. The reason is that they learn the semantic scene and person generation by fine-tuning a common pre-trained diffusion, which involves an irreconcilable training imbalance. Precisely, to generate realistic persons, they need to sufficiently tune the pre-trained model, which inevitably causes the model to forget the rich semantic scene prior and makes scene generation over-fit to the training data. Moreover, even with sufficient fine-tuning, these methods can still not generate high-fidelity persons since joint learning of the scene and person generation also lead to quality compromise. In this paper, we propose Face-diffuser, an effective collaborative generation pipeline to eliminate the above training imbalance and quality compromise. Specifically, we first develop two specialized pre-trained diffusion models, i.e., Text-driven Diffusion Model (TDM) and Subject-augmented Diffusion Model (SDM), for scene and person generation, respectively. The sampling process is divided into three sequential stages, i.e., semantic scene construction, subject-scene fusion, and subject enhancement. The first and last stages are performed by TDM and SDM respectively. The subject-scene fusion stage, that is the collaboration achieved through a novel and highly effective mechanism, Saliency-adaptive Noise Fusion (SNF). Specifically, it is based on our key observation that there exists a robust link between classifier-free guidance responses and the saliency of generated images. In each time step, SNF leverages the unique strengths of each model and allows for the spatial blending of predicted noises from both models automatically in a saliency-aware manner. Extensive experiments confirm the impressive effectiveness and robustness of the Face-diffuser.

5/6/2024

SceneTextGen: Layout-Agnostic Scene Text Image Synthesis with Diffusion Models

Qilong Zhangli, Jindong Jiang, Di Liu, Licheng Yu, Xiaoliang Dai, Ankit Ramchandani, Guan Pang, Dimitris N. Metaxas, Praveen Krishnan

While diffusion models have significantly advanced the quality of image generation their capability to accurately and coherently render text within these images remains a substantial challenge. Conventional diffusion-based methods for scene text generation are typically limited by their reliance on an intermediate layout output. This dependency often results in a constrained diversity of text styles and fonts an inherent limitation stemming from the deterministic nature of the layout generation phase. To address these challenges this paper introduces SceneTextGen a novel diffusion-based model specifically designed to circumvent the need for a predefined layout stage. By doing so SceneTextGen facilitates a more natural and varied representation of text. The novelty of SceneTextGen lies in its integration of three key components: a character-level encoder for capturing detailed typographic properties coupled with a character-level instance segmentation model and a word-level spotting model to address the issues of unwanted text generation and minor character inaccuracies. We validate the performance of our method by demonstrating improved character recognition rates on generated images across different public visual text datasets in comparison to both standard diffusion based methods and text specific methods.

7/23/2024