HQ-DiT: Efficient Diffusion Transformer with FP4 Hybrid Quantization

2405.19751

Published 6/3/2024 by Wenxuan Liu, Sai Qian Zhang

HQ-DiT: Efficient Diffusion Transformer with FP4 Hybrid Quantization

Abstract

Diffusion Transformers (DiTs) have recently gained substantial attention in both industrial and academic fields for their superior visual generation capabilities, outperforming traditional diffusion models that use U-Net. However,the enhanced performance of DiTs also comes with high parameter counts and implementation costs, seriously restricting their use on resource-limited devices such as mobile phones. To address these challenges, we introduce the Hybrid Floating-point Quantization for DiT(HQ-DiT), an efficient post-training quantization method that utilizes 4-bit floating-point (FP) precision on both weights and activations for DiT inference. Compared to fixed-point quantization (e.g., INT8), FP quantization, complemented by our proposed clipping range selection mechanism, naturally aligns with the data distribution within DiT, resulting in a minimal quantization error. Furthermore, HQ-DiT also implements a universal identity mathematical transform to mitigate the serious quantization error caused by the outliers. The experimental results demonstrate that DiT can achieve extremely low-precision quantization (i.e., 4 bits) with negligible impact on performance. Our approach marks the first instance where both weights and activations in DiTs are quantized to just 4 bits, with only a 0.12 increase in sFID on ImageNet.

Create account to get full access

Overview

• This paper introduces HQ-DiT, an efficient diffusion transformer model that leverages FP4 hybrid quantization to achieve significant memory and computational savings without compromising performance.

• The researchers develop a novel quantization scheme that combines FP4 (a custom 4-bit floating-point format) with lower-precision integer weights and activations to reduce the model size and inference latency.

• Experiments on text-to-image generation tasks show that HQ-DiT matches the performance of full-precision models while being 2-4x smaller and 1.5-2x faster.

Plain English Explanation

The paper presents a new machine learning model called HQ-DiT that is designed to be more efficient and faster than existing models, while still maintaining high performance. The key innovation is a new way of compressing the model, called "hybrid quantization," which combines a custom 4-bit floating-point format with lower-precision integer numbers for the model's internal weights and activations.

This hybrid approach allows the model to be significantly smaller in size (2-4x smaller) and faster during inference (1.5-2x faster) compared to standard full-precision models, without sacrificing the quality of the output. The researchers tested HQ-DiT on text-to-image generation tasks, where the model takes text descriptions as input and generates corresponding images as output.

The efficiency gains of HQ-DiT come from the custom quantization scheme, which is able to compress the model's internal parameters and computations without losing important information. This makes the model more practical to deploy on resource-constrained devices like mobile phones or embedded systems.

Technical Explanation

The key technical innovations in HQ-DiT are:

FP4 Hybrid Quantization: The researchers developed a novel quantization scheme that combines 4-bit custom floating-point (FP4) and lower-precision integer weights and activations. This hybrid approach allows for significant memory and computational savings compared to standard full-precision models.
Diffusion Transformer Architecture: HQ-DiT uses a diffusion-based transformer model for text-to-image generation, which has been shown to produce high-quality images. The researchers adapted this architecture to work with the FP4 hybrid quantization.
Quantization-Aware Training: The model is trained end-to-end with the FP4 hybrid quantization in mind, allowing the quantized weights and activations to be optimized for the task.

Experiments on standard text-to-image benchmarks demonstrate that HQ-DiT matches the performance of full-precision models while being 2-4x smaller in size and 1.5-2x faster during inference. This makes the model much more practical for deployment on resource-constrained devices.

Critical Analysis

The paper provides a thorough evaluation of HQ-DiT and highlights its strengths, but there are a few potential limitations and areas for further research:

The experiments are primarily focused on text-to-image generation, so it is unclear how well the FP4 hybrid quantization would generalize to other tasks or model architectures.
The paper does not discuss the potential trade-offs or limitations of the custom FP4 format compared to standard integer or floating-point representations.
While the efficiency gains are significant, the authors do not provide a detailed comparison to other recent quantization techniques, such as PTQ4DiT, TerDiT, or MixDQ.
The paper could also have discussed the potential challenges or limitations of the quantization-aware training process, such as convergence issues or sensitivity to hyperparameters.

Overall, HQ-DiT represents an important step forward in efficient diffusion transformer models, but further research and benchmarking against other state-of-the-art quantization techniques would be valuable to fully assess its capabilities and limitations.

Conclusion

The HQ-DiT paper presents a novel efficient diffusion transformer model that leverages a custom FP4 hybrid quantization scheme to achieve significant memory and computational savings without compromising performance on text-to-image generation tasks. The key innovations include the FP4 hybrid quantization approach and the quantization-aware training process. Experimental results demonstrate that HQ-DiT can match the quality of full-precision models while being 2-4x smaller and 1.5-2x faster, making it a promising candidate for deployment on resource-constrained devices. Further research to explore the generalizability of the approach and compare it to other state-of-the-art quantization techniques would be valuable.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

PTQ4DiT: Post-training Quantization for Diffusion Transformers

Junyi Wu, Haoxuan Wang, Yuzhang Shang, Mubarak Shah, Yan Yan

The recent introduction of Diffusion Transformers (DiTs) has demonstrated exceptional capabilities in image generation by using a different backbone architecture, departing from traditional U-Nets and embracing the scalable nature of transformers. Despite their advanced capabilities, the wide deployment of DiTs, particularly for real-time applications, is currently hampered by considerable computational demands at the inference stage. Post-training Quantization (PTQ) has emerged as a fast and data-efficient solution that can significantly reduce computation and memory footprint by using low-bit weights and activations. However, its applicability to DiTs has not yet been explored and faces non-trivial difficulties due to the unique design of DiTs. In this paper, we propose PTQ4DiT, a specifically designed PTQ method for DiTs. We discover two primary quantization challenges inherent in DiTs, notably the presence of salient channels with extreme magnitudes and the temporal variability in distributions of salient activation over multiple timesteps. To tackle these challenges, we propose Channel-wise Salience Balancing (CSB) and Spearmen's $rho$-guided Salience Calibration (SSC). CSB leverages the complementarity property of channel magnitudes to redistribute the extremes, alleviating quantization errors for both activations and weights. SSC extends this approach by dynamically adjusting the balanced salience to capture the temporal variations in activation. Additionally, to eliminate extra computational costs caused by PTQ4DiT during inference, we design an offline re-parameterization strategy for DiTs. Experiments demonstrate that our PTQ4DiT successfully quantizes DiTs to 8-bit precision (W8A8) while preserving comparable generation ability and further enables effective quantization to 4-bit weight precision (W4A8) for the first time.

5/28/2024

cs.CV

ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation

Tianchen Zhao, Tongcheng Fang, Enshu Liu, Wan Rui, Widyadewi Soedarmadji, Shiyao Li, Zinan Lin, Guohao Dai, Shengen Yan, Huazhong Yang, Xuefei Ning, Yu Wang

Diffusion transformers (DiTs) have exhibited remarkable performance in visual generation tasks, such as generating realistic images or videos based on textual instructions. However, larger model sizes and multi-frame processing for video generation lead to increased computational and memory costs, posing challenges for practical deployment on edge devices. Post-Training Quantization (PTQ) is an effective method for reducing memory costs and computational complexity. When quantizing diffusion transformers, we find that applying existing diffusion quantization methods designed for U-Net faces challenges in preserving quality. After analyzing the major challenges for quantizing diffusion transformers, we design an improved quantization scheme: ViDiT-Q: Video and Image Diffusion Transformer Quantization) to address these issues. Furthermore, we identify highly sensitive layers and timesteps hinder quantization for lower bit-widths. To tackle this, we improve ViDiT-Q with a novel metric-decoupled mixed-precision quantization method (ViDiT-Q-MP). We validate the effectiveness of ViDiT-Q across a variety of text-to-image and video models. While baseline quantization methods fail at W8A8 and produce unreadable content at W4A8, ViDiT-Q achieves lossless W8A8 quantization. ViDiTQ-MP achieves W4A8 with negligible visual quality degradation, resulting in a 2.5x memory optimization and a 1.5x latency speedup.

6/5/2024

cs.CV

🌐

TerDiT: Ternary Diffusion Models with Transformers

Xudong Lu, Aojun Zhou, Ziyi Lin, Qi Liu, Yuhui Xu, Renrui Zhang, Yafei Wen, Shuai Ren, Peng Gao, Junchi Yan, Hongsheng Li

Recent developments in large-scale pre-trained text-to-image diffusion models have significantly improved the generation of high-fidelity images, particularly with the emergence of diffusion models based on transformer architecture (DiTs). Among these diffusion models, diffusion transformers have demonstrated superior image generation capabilities, boosting lower FID scores and higher scalability. However, deploying large-scale DiT models can be expensive due to their extensive parameter numbers. Although existing research has explored efficient deployment techniques for diffusion models such as model quantization, there is still little work concerning DiT-based models. To tackle this research gap, in this paper, we propose TerDiT, a quantization-aware training (QAT) and efficient deployment scheme for ternary diffusion models with transformers. We focus on the ternarization of DiT networks and scale model sizes from 600M to 4.2B. Our work contributes to the exploration of efficient deployment strategies for large-scale DiT models, demonstrating the feasibility of training extremely low-bit diffusion transformer models from scratch while maintaining competitive image generation capacities compared to full-precision models. Code will be available at https://github.com/Lucky-Lance/TerDiT.

5/24/2024

cs.CV cs.LG

MixDQ: Memory-Efficient Few-Step Text-to-Image Diffusion Models with Metric-Decoupled Mixed Precision Quantization

Tianchen Zhao, Xuefei Ning, Tongcheng Fang, Enshu Liu, Guyue Huang, Zinan Lin, Shengen Yan, Guohao Dai, Yu Wang

Diffusion models have achieved significant visual generation quality. However, their significant computational and memory costs pose challenge for their application on resource-constrained mobile devices or even desktop GPUs. Recent few-step diffusion models reduces the inference time by reducing the denoising steps. However, their memory consumptions are still excessive. The Post Training Quantization (PTQ) replaces high bit-width FP representation with low-bit integer values (INT4/8) , which is an effective and efficient technique to reduce the memory cost. However, when applying to few-step diffusion models, existing quantization methods face challenges in preserving both the image quality and text alignment. To address this issue, we propose an mixed-precision quantization framework - MixDQ. Firstly, We design specialized BOS-aware quantization method for highly sensitive text embedding quantization. Then, we conduct metric-decoupled sensitivity analysis to measure the sensitivity of each layer. Finally, we develop an integer-programming-based method to conduct bit-width allocation. While existing quantization methods fall short at W8A8, MixDQ could achieve W8A8 without performance loss, and W4A8 with negligible visual degradation. Compared with FP16, we achieve 3-4x reduction in model size and memory cost, and 1.45x latency speedup.

5/31/2024

cs.CV cs.AI