Upsample Guidance: Scale Up Diffusion Models without Training

2404.01709

Published 4/3/2024 by Juno Hwang, Yong-Hyun Park, Junghyo Jo

🏋️

Abstract

Diffusion models have demonstrated superior performance across various generative tasks including images, videos, and audio. However, they encounter difficulties in directly generating high-resolution samples. Previously proposed solutions to this issue involve modifying the architecture, further training, or partitioning the sampling process into multiple stages. These methods have the limitation of not being able to directly utilize pre-trained models as-is, requiring additional work. In this paper, we introduce upsample guidance, a technique that adapts pretrained diffusion model (e.g., $512^2$) to generate higher-resolution images (e.g., $1536^2$) by adding only a single term in the sampling process. Remarkably, this technique does not necessitate any additional training or relying on external models. We demonstrate that upsample guidance can be applied to various models, such as pixel-space, latent space, and video diffusion models. We also observed that the proper selection of guidance scale can improve image quality, fidelity, and prompt alignment.

Create account to get full access

Overview

Diffusion models have shown great performance in generating images, videos, and audio.
However, they struggle to directly produce high-resolution samples.
Previous solutions involved modifying the model architecture, further training, or using multiple sampling stages.
These methods require additional work and can't easily use pre-trained models.

Plain English Explanation

Diffusion models are a type of AI technology that can generate all sorts of content, from images to videos to audio. They've proven to be really good at this task, often outperforming other approaches.

The challenge is that these models have a hard time creating high-resolution, detailed outputs. For example, they might be able to generate a nice 512x512 pixel image, but struggle to make a 1536x1536 version that looks just as good.

Researchers have tried to solve this by changing the model's architecture, training it further, or breaking the generation process into multiple steps. But these solutions come with downsides - they require a lot of additional work, and the models can't easily use pre-trained versions as a starting point.

Technical Explanation

This paper introduces a new technique called "upsample guidance" that can adapt pre-trained diffusion models to generate higher-resolution images without needing to retrain the model or rely on external components.

The key idea is to add a single extra term to the model's sampling process that guides it towards producing higher-resolution outputs. Remarkably, this simple addition allows the model to go from generating 512x512 images to 1536x1536 images, all while using the original pre-trained diffusion model.

The authors demonstrate that upsample guidance can be applied to different types of diffusion models, including those that work in pixel space, latent space, and even video generation. They also find that carefully selecting the "guidance scale" parameter can further improve the quality, fidelity, and alignment of the generated samples.

Critical Analysis

The paper provides a clever and efficient solution to the challenge of high-resolution generation with diffusion models. By avoiding the need for architectural changes or additional training, the upsample guidance technique makes it much easier to take advantage of pre-trained models.

However, the authors don't explore the limits of this approach. It's unclear how high of a resolution increase can be achieved, or whether there are any downsides to the guidance term in terms of sample quality or consistency.

Additionally, the paper doesn't delve into potential failure modes or edge cases where the upsample guidance might not work as well. Investigating these aspects could help users understand the technique's strengths and limitations more fully.

Conclusion

This research introduces a simple but powerful technique that allows diffusion models to generate high-resolution outputs without major changes to the underlying architecture or training process. By adapting pre-trained models through a single guidance term, the upsample guidance approach makes it much more practical to leverage the impressive capabilities of diffusion models for high-fidelity content generation.

While the paper doesn't explore all the potential nuances and edge cases, it represents an important step forward in making diffusion models more accessible and flexible for real-world applications. As the field continues to advance, techniques like this will likely play a key role in unlocking the full potential of these generative models.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

DiffuseHigh: Training-free Progressive High-Resolution Image Synthesis through Structure Guidance

Younghyun Kim, Geunmin Hwang, Eunbyung Park

Recent surge in large-scale generative models has spurred the development of vast fields in computer vision. In particular, text-to-image diffusion models have garnered widespread adoption across diverse domain due to their potential for high-fidelity image generation. Nonetheless, existing large-scale diffusion models are confined to generate images of up to 1K resolution, which is far from meeting the demands of contemporary commercial applications. Directly sampling higher-resolution images often yields results marred by artifacts such as object repetition and distorted shapes. Addressing the aforementioned issues typically necessitates training or fine-tuning models on higher resolution datasets. However, this undertaking poses a formidable challenge due to the difficulty in collecting large-scale high-resolution contents and substantial computational resources. While several preceding works have proposed alternatives, they often fail to produce convincing results. In this work, we probe the generative ability of diffusion models at higher resolution beyond its original capability and propose a novel progressive approach that fully utilizes generated low-resolution image to guide the generation of higher resolution image. Our method obviates the need for additional training or fine-tuning which significantly lowers the burden of computational costs. Extensive experiments and results validate the efficiency and efficacy of our method.

6/27/2024

cs.CV

Greedy Growing Enables High-Resolution Pixel-Based Diffusion Models

Cristina N. Vasconcelos, Abdullah Rashwan Austin Waters, Trevor Walker, Keyang Xu, Jimmy Yan, Rui Qian, Shixin Luo, Zarana Parekh, Andrew Bunner, Hongliang Fei, Roopal Garg, Mandy Guo, Ivana Kajic, Yeqing Li, Henna Nandwani, Jordi Pont-Tuset, Yasumasa Onoe, Sarah Rosston, Su Wang, Wenlei Zhou, Kevin Swersky, David J. Fleet, Jason M. Baldridge, Oliver Wang

We address the long-standing problem of how to learn effective pixel-based image diffusion models at scale, introducing a remarkably simple greedy growing method for stable training of large-scale, high-resolution models. without the needs for cascaded super-resolution components. The key insight stems from careful pre-training of core components, namely, those responsible for text-to-image alignment {it vs.} high-resolution rendering. We first demonstrate the benefits of scaling a {it Shallow UNet}, with no down(up)-sampling enc(dec)oder. Scaling its deep core layers is shown to improve alignment, object structure, and composition. Building on this core model, we propose a greedy algorithm that grows the architecture into high-resolution end-to-end models, while preserving the integrity of the pre-trained representation, stabilizing training, and reducing the need for large high-resolution datasets. This enables a single stage model capable of generating high-resolution images without the need of a super-resolution cascade. Our key results rely on public datasets and show that we are able to train non-cascaded models up to 8B parameters with no further regularization schemes. Vermeer, our full pipeline model trained with internal datasets to produce 1024x1024 images, without cascades, is preferred by 44.0% vs. 21.4% human evaluators over SDXL.

5/28/2024

cs.CV cs.LG

📊

Distribution-Aware Data Expansion with Diffusion Models

Haowei Zhu, Ling Yang, Jun-Hai Yong, Hongzhi Yin, Jiawei Jiang, Meng Xiao, Wentao Zhang, Bin Wang

The scale and quality of a dataset significantly impact the performance of deep models. However, acquiring large-scale annotated datasets is both a costly and time-consuming endeavor. To address this challenge, dataset expansion technologies aim to automatically augment datasets, unlocking the full potential of deep models. Current data expansion techniques include image transformation and image synthesis methods. Transformation-based methods introduce only local variations, leading to limited diversity. In contrast, synthesis-based methods generate entirely new content, greatly enhancing informativeness. However, existing synthesis methods carry the risk of distribution deviations, potentially degrading model performance with out-of-distribution samples. In this paper, we propose DistDiff, a training-free data expansion framework based on the distribution-aware diffusion model. DistDiff constructs hierarchical prototypes to approximate the real data distribution, optimizing latent data points within diffusion models with hierarchical energy guidance. We demonstrate its capability to generate distribution-consistent samples, significantly improving data expansion tasks. DistDiff consistently enhances accuracy across a diverse range of datasets compared to models trained solely on original data. Furthermore, our approach consistently outperforms existing synthesis-based techniques and demonstrates compatibility with widely adopted transformation-based augmentation methods. Additionally, the expanded dataset exhibits robustness across various architectural frameworks. Our code is available at https://github.com/haoweiz23/DistDiff

6/6/2024

cs.CV

Improved Sample Complexity Bounds for Diffusion Model Training

Shivam Gupta, Aditya Parulekar, Eric Price, Zhiyang Xun

Diffusion models have become the most popular approach to deep generative modeling of images, largely due to their empirical performance and reliability. From a theoretical standpoint, a number of recent works~cite{chen2022,chen2022improved,benton2023linear} have studied the iteration complexity of sampling, assuming access to an accurate diffusion model. In this work, we focus on understanding the emph{sample complexity} of training such a model; how many samples are needed to learn an accurate diffusion model using a sufficiently expressive neural network? Prior work~cite{BMR20} showed bounds polynomial in the dimension, desired Total Variation error, and Wasserstein error. We show an emph{exponential improvement} in the dependence on Wasserstein error and depth, along with improved dependencies on other relevant parameters.

6/11/2024

cs.LG cs.CV cs.IT stat.ML