DiM: Diffusion Mamba for Efficient High-Resolution Image Synthesis

2405.14224

Published 5/24/2024 by Yao Teng, Yue Wu, Han Shi, Xuefei Ning, Guohao Dai, Yu Wang, Zhenguo Li, Xihui Liu

🖼️

Abstract

Diffusion models have achieved great success in image generation, with the backbone evolving from U-Net to Vision Transformers. However, the computational cost of Transformers is quadratic to the number of tokens, leading to significant challenges when dealing with high-resolution images. In this work, we propose Diffusion Mamba (DiM), which combines the efficiency of Mamba, a sequence model based on State Space Models (SSM), with the expressive power of diffusion models for efficient high-resolution image synthesis. To address the challenge that Mamba cannot generalize to 2D signals, we make several architecture designs including multi-directional scans, learnable padding tokens at the end of each row and column, and lightweight local feature enhancement. Our DiM architecture achieves inference-time efficiency for high-resolution images. In addition, to further improve training efficiency for high-resolution image generation with DiM, we investigate ``weak-to-strong'' training strategy that pretrains DiM on low-resolution images ($256times 256$) and then finetune it on high-resolution images ($512 times 512$). We further explore training-free upsampling strategies to enable the model to generate higher-resolution images (e.g., $1024times 1024$ and $1536times 1536$) without further fine-tuning. Experiments demonstrate the effectiveness and efficiency of our DiM.

Create account to get full access

Overview

Diffusion models have achieved great success in image generation, but the computational cost of the popular Transformer models is a significant challenge, especially for high-resolution images.
The paper proposes a new model called Diffusion Mamba (DiM) that combines the efficiency of Mamba, a sequence model based on State Space Models (SSM), with the expressive power of diffusion models.
To address the challenge that Mamba cannot generalize to 2D signals, the paper introduces several architectural designs, including multi-directional scans, learnable padding tokens, and lightweight local feature enhancement.
The paper also explores training strategies, such as "weak-to-strong" training and training-free upsampling, to improve the efficiency of high-resolution image generation with DiM.

Plain English Explanation

Diffusion models are a type of machine learning model that have been very successful at generating high-quality images. However, the most popular diffusion models, which use Transformer architectures, can be computationally expensive, especially when working with large, high-resolution images.

To address this issue, the researchers developed a new model called Diffusion Mamba (DiM) that combines the efficiency of a different type of model called Mamba with the powerful image generation capabilities of diffusion models. Mamba models are based on a mathematical concept called State Space Models, which allow them to be more computationally efficient than Transformers.

Since Mamba was originally designed for 1D sequences, the researchers had to make some changes to the architecture to make it work well for 2D images. This includes things like scanning the image in multiple directions, adding special padding tokens, and using lightweight features to enhance the model's understanding of the image.

The researchers also explored different training strategies to make DiM even more efficient for generating high-resolution images. One approach they tried was to first train the model on low-resolution images and then fine-tune it on high-resolution images. They also experimented with ways to generate even higher resolution images without having to further train the model.

Overall, the goal of this research was to create a diffusion-based image generation model that is more computationally efficient, particularly for large, high-resolution images, which is an important practical consideration for many real-world applications.

Technical Explanation

The paper proposes a new diffusion model architecture called Diffusion Mamba (DiM) that combines the efficiency of Mamba, a sequence model based on State Space Models (SSM), with the expressive power of diffusion models for efficient high-resolution image synthesis.

To address the challenge that Mamba cannot generalize to 2D signals, the paper introduces several architectural designs:

Multi-directional Scans: The model scans the image in multiple directions (e.g., left-to-right, right-to-left, top-to-bottom, bottom-to-top) to capture spatial information.
Learnable Padding Tokens: The model adds learnable padding tokens at the end of each row and column to help the model understand the 2D structure of the image.
Lightweight Local Feature Enhancement: The model uses lightweight local feature enhancement modules to improve the model's understanding of the image's local features.

The paper also explores two training strategies to improve the efficiency of high-resolution image generation with DiM:

Weak-to-Strong Training: The model is first pretrained on low-resolution images (256x256) and then fine-tuned on high-resolution images (512x512).
Training-free Upsampling: The trained DiM model can be used to generate even higher resolution images (e.g., 1024x1024, 1536x1536) without further fine-tuning.

Experiments demonstrate the effectiveness and efficiency of the proposed DiM architecture in generating high-quality, high-resolution images.

Critical Analysis

The paper addresses an important practical challenge in the field of diffusion-based image generation – the computational cost of Transformer-based models when dealing with high-resolution images. The proposed DiM architecture, which combines the efficiency of Mamba with the expressive power of diffusion models, is a novel and promising approach to address this challenge.

One potential limitation of the research is that the architectural changes made to adapt Mamba for 2D images, such as the multi-directional scans and learnable padding tokens, may not be as well-studied or understood as the core Transformer architecture. It would be interesting to see how these design choices compare to alternative approaches for adapting sequence models to 2D signals.

Additionally, while the training strategies explored in the paper, such as weak-to-strong training and training-free upsampling, demonstrate the potential for improved efficiency, it would be valuable to understand the trade-offs and limitations of these techniques. For example, how does the performance of the training-free upsampling approach compare to fine-tuning the model on higher-resolution images?

Finally, the paper does not provide a detailed analysis of the model's performance on diverse image datasets or real-world applications. Evaluating the DiM architecture's generalization capabilities and its suitability for different use cases would be an important next step in assessing the practical impact of this research.

Overall, the proposed DiM architecture and the accompanying training strategies represent a promising direction for improving the efficiency of diffusion-based image generation, particularly for high-resolution images. Further research and validation of the model's capabilities across a wider range of datasets and applications would be valuable to the field.

Conclusion

The paper presents Diffusion Mamba (DiM), a novel architecture that combines the efficiency of Mamba, a sequence model based on State Space Models, with the expressive power of diffusion models for efficient high-resolution image synthesis.

To address the challenge of adapting Mamba to 2D signals, the researchers introduced architectural designs such as multi-directional scans, learnable padding tokens, and lightweight local feature enhancement. The paper also explored training strategies, including weak-to-strong training and training-free upsampling, to further improve the efficiency of high-resolution image generation with DiM.

The proposed DiM architecture represents a promising approach to addressing the computational challenges of Transformer-based diffusion models, particularly for practical applications requiring the generation of high-quality, high-resolution images. While the research has demonstrated the effectiveness and efficiency of DiM, further exploration of its generalization capabilities and real-world performance would be valuable for assessing the broader impact of this work.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Scaling Diffusion Mamba with Bidirectional SSMs for Efficient Image and Video Generation

Shentong Mo, Yapeng Tian

In recent developments, the Mamba architecture, known for its selective state space approach, has shown potential in the efficient modeling of long sequences. However, its application in image generation remains underexplored. Traditional diffusion transformers (DiT), which utilize self-attention blocks, are effective but their computational complexity scales quadratically with the input length, limiting their use for high-resolution images. To address this challenge, we introduce a novel diffusion architecture, Diffusion Mamba (DiM), which foregoes traditional attention mechanisms in favor of a scalable alternative. By harnessing the inherent efficiency of the Mamba architecture, DiM achieves rapid inference times and reduced computational load, maintaining linear complexity with respect to sequence length. Our architecture not only scales effectively but also outperforms existing diffusion transformers in both image and video generation tasks. The results affirm the scalability and efficiency of DiM, establishing a new benchmark for image and video generation techniques. This work advances the field of generative models and paves the way for further applications of scalable architectures.

5/28/2024

cs.CV cs.AI cs.LG

Efficient 3D Shape Generation via Diffusion Mamba with Bidirectional SSMs

Shentong Mo

Recent advancements in sequence modeling have led to the development of the Mamba architecture, noted for its selective state space approach, offering a promising avenue for efficient long sequence handling. However, its application in 3D shape generation, particularly at high resolutions, remains underexplored. Traditional diffusion transformers (DiT) with self-attention mechanisms, despite their potential, face scalability challenges due to the cubic complexity of attention operations as input length increases. This complexity becomes a significant hurdle when dealing with high-resolution voxel sizes. To address this challenge, we introduce a novel diffusion architecture tailored for 3D point clouds generation-Diffusion Mamba (DiM-3D). This architecture forgoes traditional attention mechanisms, instead utilizing the inherent efficiency of the Mamba architecture to maintain linear complexity with respect to sequence length. DiM-3D is characterized by fast inference times and substantially lower computational demands, quantified in reduced Gflops, thereby addressing the key scalability issues of prior models. Our empirical results on the ShapeNet benchmark demonstrate that DiM-3D achieves state-of-the-art performance in generating high-fidelity and diverse 3D shapes. Additionally, DiM-3D shows superior capabilities in tasks like 3D point cloud completion. This not only proves the model's scalability but also underscores its efficiency in generating detailed, high-resolution voxels necessary for advanced 3D shape modeling, particularly excelling in environments requiring high-resolution voxel sizes. Through these findings, we illustrate the exceptional scalability and efficiency of the Diffusion Mamba framework in 3D shape generation, setting a new standard for the field and paving the way for future explorations in high-resolution 3D modeling technologies.

6/10/2024

cs.CV cs.AI cs.LG

Dimba: Transformer-Mamba Diffusion Models

Zhengcong Fei, Mingyuan Fan, Changqian Yu, Debang Li, Youqiang Zhang, Junshi Huang

This paper unveils Dimba, a new text-to-image diffusion model that employs a distinctive hybrid architecture combining Transformer and Mamba elements. Specifically, Dimba sequentially stacked blocks alternate between Transformer and Mamba layers, and integrate conditional information through the cross-attention layer, thus capitalizing on the advantages of both architectural paradigms. We investigate several optimization strategies, including quality tuning, resolution adaption, and identify critical configurations necessary for large-scale image generation. The model's flexible design supports scenarios that cater to specific resource constraints and objectives. When scaled appropriately, Dimba offers substantial throughput and a reduced memory footprint relative to conventional pure Transformers-based benchmarks. Extensive experiments indicate that Dimba achieves comparable performance compared with benchmarks in terms of image quality, artistic rendering, and semantic control. We also report several intriguing properties of architecture discovered during evaluation and release checkpoints in experiments. Our findings emphasize the promise of large-scale hybrid Transformer-Mamba architectures in the foundational stage of diffusion models, suggesting a bright future for text-to-image generation.

6/4/2024

cs.CV

Soft Masked Mamba Diffusion Model for CT to MRI Conversion

Zhenbin Wang, Lei Zhang, Lituan Wang, Zhenwei Zhang

Magnetic Resonance Imaging (MRI) and Computed Tomography (CT) are the predominant modalities utilized in the field of medical imaging. Although MRI capture the complexity of anatomical structures with greater detail than CT, it entails a higher financial costs and requires longer image acquisition times. In this study, we aim to train latent diffusion model for CT to MRI conversion, replacing the commonly-used U-Net or Transformer backbone with a State-Space Model (SSM) called Mamba that operates on latent patches. First, we noted critical oversights in the scan scheme of most Mamba-based vision methods, including inadequate attention to the spatial continuity of patch tokens and the lack of consideration for their varying importance to the target task. Secondly, extending from this insight, we introduce Diffusion Mamba (DiffMa), employing soft masked to integrate Cross-Sequence Attention into Mamba and conducting selective scan in a spiral manner. Lastly, extensive experiments demonstrate impressive performance by DiffMa in medical image generation tasks, with notable advantages in input scaling efficiency over existing benchmark models. The code and models are available at https://github.com/wongzbb/DiffMa-Diffusion-Mamba

6/26/2024

cs.CV