Exploiting Diffusion Prior for Real-World Image Super-Resolution

2305.07015

Published 7/1/2024 by Jianyi Wang, Zongsheng Yue, Shangchen Zhou, Kelvin C. K. Chan, Chen Change Loy

🖼️

Abstract

We present a novel approach to leverage prior knowledge encapsulated in pre-trained text-to-image diffusion models for blind super-resolution (SR). Specifically, by employing our time-aware encoder, we can achieve promising restoration results without altering the pre-trained synthesis model, thereby preserving the generative prior and minimizing training cost. To remedy the loss of fidelity caused by the inherent stochasticity of diffusion models, we employ a controllable feature wrapping module that allows users to balance quality and fidelity by simply adjusting a scalar value during the inference process. Moreover, we develop a progressive aggregation sampling strategy to overcome the fixed-size constraints of pre-trained diffusion models, enabling adaptation to resolutions of any size. A comprehensive evaluation of our method using both synthetic and real-world benchmarks demonstrates its superiority over current state-of-the-art approaches. Code and models are available at https://github.com/IceClear/StableSR.

Create account to get full access

Overview

Presents a novel approach to leverage pre-trained text-to-image diffusion models for blind super-resolution (SR)
Employs a time-aware encoder to achieve promising restoration results without altering the pre-trained synthesis model
Introduces a controllable feature wrapping module to balance quality and fidelity during inference
Develops a progressive aggregation sampling strategy to overcome the fixed-size constraints of pre-trained diffusion models

Plain English Explanation

This research proposes a new way to enhance the resolution of low-quality images by tapping into the knowledge captured by pre-trained image generation models. The key idea is to use a specialized "time-aware encoder" that can work with the existing image generation model without needing to modify it. This allows the system to benefit from the powerful generative capabilities of the pre-trained model, while also enabling high-quality image restoration.

To address the inherent unpredictability of diffusion models, the researchers developed a "controllable feature wrapping module" that lets users fine-tune the balance between image quality and fidelity during the restoration process. They also devised a "progressive aggregation sampling strategy" to overcome the fixed-size limitations of the pre-trained models, enabling the system to handle images of any resolution.

The researchers thoroughly evaluated their method using both synthetic and real-world benchmarks, demonstrating that it outperforms the current state-of-the-art approaches in blind super-resolution.

Technical Explanation

The researchers' novel approach leverages the powerful generative priors captured by pre-trained text-to-image diffusion models, such as those used in Boosting Flow-based Generative Super-Resolution Models and Burst Super-Resolution Diffusion Models, for the task of blind super-resolution. They employ a time-aware encoder that can effectively utilize the pre-trained synthesis model without altering its architecture, thereby preserving the generative prior and minimizing training costs.

To address the inherent stochasticity of diffusion models, which can lead to a loss of fidelity in the restored images, the researchers introduce a controllable feature wrapping module. This module allows users to balance the trade-off between quality and fidelity during the inference process by adjusting a scalar value.

Furthermore, the researchers develop a progressive aggregation sampling strategy to overcome the fixed-size constraints of pre-trained diffusion models. This strategy enables their system to adapt to input images of any resolution, as demonstrated in One-step Effective Diffusion Network for Real-World and DiffuseHigh: Training-Free Progressive High-Resolution Image super-resolution approaches.

Critical Analysis

The researchers acknowledge that their method, while demonstrating promising results, is not without limitations. They mention the potential for further improvements in the balance between quality and fidelity, as well as the need to explore more efficient inference strategies to reduce computational costs.

Additionally, the researchers note that their approach currently relies on pre-trained diffusion models, which may limit its applicability to domains where such models are not readily available. Extending the method to work with other generative models or exploring ways to train the diffusion model and the restoration components jointly could be avenues for future research.

It's worth considering the impact of the researchers' assumptions, such as the availability of pre-trained diffusion models and the specific characteristics of the benchmarks used for evaluation. Validating the method's robustness across a wider range of real-world scenarios and data distributions would provide a more comprehensive understanding of its strengths and limitations.

Conclusion

The researchers present a novel approach that leverages the powerful generative priors of pre-trained text-to-image diffusion models to tackle the challenge of blind super-resolution. By employing a time-aware encoder, a controllable feature wrapping module, and a progressive aggregation sampling strategy, they demonstrate promising results that outperform current state-of-the-art methods.

This work highlights the potential of utilizing pre-trained generative models for challenging image restoration tasks, while also addressing the inherent limitations of diffusion models. The proposed techniques could inspire further advancements in the field of super-resolution and potentially find applications in various domains, such as enhancing the quality of low-resolution images or videos.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

SeeSR: Towards Semantics-Aware Real-World Image Super-Resolution

Rongyuan Wu, Tao Yang, Lingchen Sun, Zhengqiang Zhang, Shuai Li, Lei Zhang

Owe to the powerful generative priors, the pre-trained text-to-image (T2I) diffusion models have become increasingly popular in solving the real-world image super-resolution problem. However, as a consequence of the heavy quality degradation of input low-resolution (LR) images, the destruction of local structures can lead to ambiguous image semantics. As a result, the content of reproduced high-resolution image may have semantic errors, deteriorating the super-resolution performance. To address this issue, we present a semantics-aware approach to better preserve the semantic fidelity of generative real-world image super-resolution. First, we train a degradation-aware prompt extractor, which can generate accurate soft and hard semantic prompts even under strong degradation. The hard semantic prompts refer to the image tags, aiming to enhance the local perception ability of the T2I model, while the soft semantic prompts compensate for the hard ones to provide additional representation information. These semantic prompts encourage the T2I model to generate detailed and semantically accurate results. Furthermore, during the inference process, we integrate the LR images into the initial sampling noise to mitigate the diffusion model's tendency to generate excessive random details. The experiments show that our method can reproduce more realistic image details and hold better the semantics. The source code of our method can be found at https://github.com/cswry/SeeSR.

6/5/2024

cs.CV

Boosting Flow-based Generative Super-Resolution Models via Learned Prior

Li-Yuan Tsao, Yi-Chen Lo, Chia-Che Chang, Hao-Wei Chen, Roy Tseng, Chien Feng, Chun-Yi Lee

Flow-based super-resolution (SR) models have demonstrated astonishing capabilities in generating high-quality images. However, these methods encounter several challenges during image generation, such as grid artifacts, exploding inverses, and suboptimal results due to a fixed sampling temperature. To overcome these issues, this work introduces a conditional learned prior to the inference phase of a flow-based SR model. This prior is a latent code predicted by our proposed latent module conditioned on the low-resolution image, which is then transformed by the flow model into an SR image. Our framework is designed to seamlessly integrate with any contemporary flow-based SR model without modifying its architecture or pre-trained weights. We evaluate the effectiveness of our proposed framework through extensive experiments and ablation analyses. The proposed framework successfully addresses all the inherent issues in flow-based SR models and enhances their performance in various SR scenarios. Our code is available at: https://github.com/liyuantsao/BFSR

5/30/2024

cs.CV cs.AI

Burst Super-Resolution with Diffusion Models for Improving Perceptual Quality

Kyotaro Tokoro, Kazutoshi Akita, Norimichi Ukita

While burst LR images are useful for improving the SR image quality compared with a single LR image, prior SR networks accepting the burst LR images are trained in a deterministic manner, which is known to produce a blurry SR image. In addition, it is difficult to perfectly align the burst LR images, making the SR image more blurry. Since such blurry images are perceptually degraded, we aim to reconstruct the sharp high-fidelity boundaries. Such high-fidelity images can be reconstructed by diffusion models. However, prior SR methods using the diffusion model are not properly optimized for the burst SR task. Specifically, the reverse process starting from a random sample is not optimized for image enhancement and restoration methods, including burst SR. In our proposed method, on the other hand, burst LR features are used to reconstruct the initial burst SR image that is fed into an intermediate step in the diffusion model. This reverse process from the intermediate step 1) skips diffusion steps for reconstructing the global structure of the image and 2) focuses on steps for refining detailed textures. Our experimental results demonstrate that our method can improve the scores of the perceptual quality metrics. Code: https://github.com/placerkyo/BSRD

4/9/2024

cs.CV

One-Step Effective Diffusion Network for Real-World Image Super-Resolution

Rongyuan Wu, Lingchen Sun, Zhiyuan Ma, Lei Zhang

The pre-trained text-to-image diffusion models have been increasingly employed to tackle the real-world image super-resolution (Real-ISR) problem due to their powerful generative image priors. Most of the existing methods start from random noise to reconstruct the high-quality (HQ) image under the guidance of the given low-quality (LQ) image. While promising results have been achieved, such Real- ISR methods require multiple diffusion steps to reproduce the HQ image, increasing the computational cost. Meanwhile, the random noise introduces uncertainty in the output, which is unfriendly to image restoration tasks. To address these issues, we propose a one-step effective diffusion network, namely OSEDiff, for the Real- ISR problem. We argue that the LQ image contains rich information to restore its HQ counterpart, and hence the given LQ image can be directly taken as the starting point for diffusion, eliminating the uncertainty introduced by random noise sampling. We finetune the pre-trained diffusion network with trainable layers to adapt it to complex image degradations. To ensure that the one-step diffusion model could yield HQ Real-ISR output, we apply variational score distillation in the latent space to conduct KL-divergence regularization. As a result, our OSEDiff model can efficiently and effectively generate HQ images in just one diffusion step. Our experiments demonstrate that OSEDiff achieves comparable or even better Real-ISR results, in terms of both objective metrics and subjective evaluations, than previous diffusion model based Real-ISR methods that require dozens or hundreds of steps. The source codes will be released at https://github.com/cswry/OSEDiff.

6/17/2024

eess.IV cs.CV