Lumina-Next: Making Lumina-T2X Stronger and Faster with Next-DiT

Read original: arXiv:2406.18583 - Published 6/28/2024 by Le Zhuo, Ruoyi Du, Han Xiao, Yangguang Li, Dongyang Liu, Rongjie Huang, Wenze Liu, Lirui Zhao, Fu-Yun Wang, Zhanyu Ma and 12 others

Lumina-Next: Making Lumina-T2X Stronger and Faster with Next-DiT

Overview

The paper presents "Lumina-Next", a new approach to improve the Lumina-T2X model, which is a text-to-any-modality system.
The key innovation is the integration of the "Next-DiT" technique, which enhances the strength and speed of Lumina-T2X.
Lumina-Next aims to push the boundaries of what is possible with generative AI models, going beyond language models to enable diverse multimedia generation.

Plain English Explanation

The researchers have developed a new version of the Lumina-T2X model, called Lumina-Next, which can generate content in various modalities like images, video, and audio, starting from text inputs. Lumina-T2X is a powerful system that can transform text into any desired format, but the researchers wanted to make it even stronger and faster.

To achieve this, they integrated a technique called "Next-DiT" into Lumina-Next. Next-DiT is a novel approach that enhances the core capabilities of the model, allowing it to produce higher-quality outputs more efficiently. This means that Lumina-Next can generate more realistic and coherent images, videos, or audio than the original Lumina-T2X, and it can do so more quickly.

The researchers believe that advances like Lumina-Next are crucial for pushing the boundaries of what's possible with generative AI. Instead of being limited to just generating text, these models can now create a wide range of multimedia content, opening up new possibilities for creative expression, education, entertainment, and beyond. By making the models stronger and faster, the researchers aim to make these powerful capabilities more accessible and useful in real-world applications.

Technical Explanation

The paper presents the "Lumina-Next" model, which builds upon the Lumina-T2X system to make it stronger and faster. Lumina-T2X is a text-to-any-modality model that can generate diverse outputs like images, video, and audio from text inputs.

To enhance Lumina-T2X, the researchers integrated a novel technique called Next-DiT, which builds on recent advancements in diffusion models and transformers. Next-DiT incorporates several key innovations, including:

Ternary Diffusion: Next-DiT uses a ternary diffusion process, which introduces an additional intermediate diffusion step, leading to more stable and higher-quality generation.
Transformer Backbone: The model utilizes a transformer-based architecture, allowing it to capture long-range dependencies and complex relationships in the input text.
Cross-Modal Attention: Next-DiT employs cross-modal attention mechanisms to better integrate the text input with the target modality, improving the coherence and fidelity of the generated outputs.

By integrating Next-DiT into the Lumina-T2X framework, the researchers have created the Lumina-Next model, which demonstrates improved performance in terms of both output quality and generation speed. This advancement builds on previous work in GenTron and Taiyi, further pushing the boundaries of what's possible with generative AI systems.

Critical Analysis

The researchers have made a compelling case for the advantages of Lumina-Next over the original Lumina-T2X model. The integration of Next-DiT appears to be a promising approach for enhancing the capabilities of text-to-any-modality systems.

However, the paper does not provide a thorough analysis of the potential limitations or drawbacks of the Lumina-Next approach. For example, it would be helpful to understand the computational and memory requirements of the model, as well as any potential biases or fairness issues that may arise from the training data or model architecture.

Additionally, the researchers could have explored the potential societal implications of such advanced generative AI systems, both positive and negative. As these models become more powerful and accessible, it will be crucial to consider their impact on fields like education, entertainment, and even misinformation.

Overall, the paper presents an exciting advancement in generative AI, but a more comprehensive discussion of the caveats and areas for further research would strengthen the analysis and better prepare readers to think critically about the technology.

Conclusion

The Lumina-Next model represents a significant step forward in the field of text-to-any-modality generation. By integrating the Next-DiT technique, the researchers have been able to enhance the strength and speed of the Lumina-T2X system, enabling the creation of higher-quality and more coherent multimedia outputs from text inputs.

This work builds on previous advancements in GenTron, Taiyi, and other generative AI beyond LLMs, pushing the boundaries of what's possible with these powerful models. As these technologies continue to evolve, it will be crucial to carefully consider their societal implications and ensure they are developed and deployed responsibly.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Lumina-Next: Making Lumina-T2X Stronger and Faster with Next-DiT

Le Zhuo, Ruoyi Du, Han Xiao, Yangguang Li, Dongyang Liu, Rongjie Huang, Wenze Liu, Lirui Zhao, Fu-Yun Wang, Zhanyu Ma, Xu Luo, Zehan Wang, Kaipeng Zhang, Xiangyang Zhu, Si Liu, Xiangyu Yue, Dingning Liu, Wanli Ouyang, Ziwei Liu, Yu Qiao, Hongsheng Li, Peng Gao

Lumina-T2X is a nascent family of Flow-based Large Diffusion Transformers that establishes a unified framework for transforming noise into various modalities, such as images and videos, conditioned on text instructions. Despite its promising capabilities, Lumina-T2X still encounters challenges including training instability, slow inference, and extrapolation artifacts. In this paper, we present Lumina-Next, an improved version of Lumina-T2X, showcasing stronger generation performance with increased training and inference efficiency. We begin with a comprehensive analysis of the Flag-DiT architecture and identify several suboptimal components, which we address by introducing the Next-DiT architecture with 3D RoPE and sandwich normalizations. To enable better resolution extrapolation, we thoroughly compare different context extrapolation methods applied to text-to-image generation with 3D RoPE, and propose Frequency- and Time-Aware Scaled RoPE tailored for diffusion transformers. Additionally, we introduced a sigmoid time discretization schedule to reduce sampling steps in solving the Flow ODE and the Context Drop method to merge redundant visual tokens for faster network evaluation, effectively boosting the overall sampling speed. Thanks to these improvements, Lumina-Next not only improves the quality and efficiency of basic text-to-image generation but also demonstrates superior resolution extrapolation capabilities and multilingual generation using decoder-based LLMs as the text encoder, all in a zero-shot manner. To further validate Lumina-Next as a versatile generative framework, we instantiate it on diverse tasks including visual recognition, multi-view, audio, music, and point cloud generation, showcasing strong performance across these domains. By releasing all codes and model weights, we aim to advance the development of next-generation generative AI capable of universal modeling.

6/28/2024

Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers

Peng Gao, Le Zhuo, Dongyang Liu, Ruoyi Du, Xu Luo, Longtian Qiu, Yuhang Zhang, Chen Lin, Rongjie Huang, Shijie Geng, Renrui Zhang, Junlin Xi, Wenqi Shao, Zhengkai Jiang, Tianshuo Yang, Weicai Ye, He Tong, Jingwen He, Yu Qiao, Hongsheng Li

Sora unveils the potential of scaling Diffusion Transformer for generating photorealistic images and videos at arbitrary resolutions, aspect ratios, and durations, yet it still lacks sufficient implementation details. In this technical report, we introduce the Lumina-T2X family - a series of Flow-based Large Diffusion Transformers (Flag-DiT) equipped with zero-initialized attention, as a unified framework designed to transform noise into images, videos, multi-view 3D objects, and audio clips conditioned on text instructions. By tokenizing the latent spatial-temporal space and incorporating learnable placeholders such as [nextline] and [nextframe] tokens, Lumina-T2X seamlessly unifies the representations of different modalities across various spatial-temporal resolutions. This unified approach enables training within a single framework for different modalities and allows for flexible generation of multimodal data at any resolution, aspect ratio, and length during inference. Advanced techniques like RoPE, RMSNorm, and flow matching enhance the stability, flexibility, and scalability of Flag-DiT, enabling models of Lumina-T2X to scale up to 7 billion parameters and extend the context window to 128K tokens. This is particularly beneficial for creating ultra-high-definition images with our Lumina-T2I model and long 720p videos with our Lumina-T2V model. Remarkably, Lumina-T2I, powered by a 5-billion-parameter Flag-DiT, requires only 35% of the training computational costs of a 600-million-parameter naive DiT. Our further comprehensive analysis underscores Lumina-T2X's preliminary capability in resolution extrapolation, high-resolution editing, generating consistent 3D views, and synthesizing videos with seamless transitions. We expect that the open-sourcing of Lumina-T2X will further foster creativity, transparency, and diversity in the generative AI community.

6/14/2024

🌐

TerDiT: Ternary Diffusion Models with Transformers

Xudong Lu, Aojun Zhou, Ziyi Lin, Qi Liu, Yuhui Xu, Renrui Zhang, Yafei Wen, Shuai Ren, Peng Gao, Junchi Yan, Hongsheng Li

Recent developments in large-scale pre-trained text-to-image diffusion models have significantly improved the generation of high-fidelity images, particularly with the emergence of diffusion models based on transformer architecture (DiTs). Among these diffusion models, diffusion transformers have demonstrated superior image generation capabilities, boosting lower FID scores and higher scalability. However, deploying large-scale DiT models can be expensive due to their extensive parameter numbers. Although existing research has explored efficient deployment techniques for diffusion models such as model quantization, there is still little work concerning DiT-based models. To tackle this research gap, in this paper, we propose TerDiT, a quantization-aware training (QAT) and efficient deployment scheme for ternary diffusion models with transformers. We focus on the ternarization of DiT networks and scale model sizes from 600M to 4.2B. Our work contributes to the exploration of efficient deployment strategies for large-scale DiT models, demonstrating the feasibility of training extremely low-bit diffusion transformer models from scratch while maintaining competitive image generation capacities compared to full-precision models. Code will be available at https://github.com/Lucky-Lance/TerDiT.

5/24/2024

GenTron: Diffusion Transformers for Image and Video Generation

Shoufa Chen, Mengmeng Xu, Jiawei Ren, Yuren Cong, Sen He, Yanping Xie, Animesh Sinha, Ping Luo, Tao Xiang, Juan-Manuel Perez-Rua

In this study, we explore Transformer-based diffusion models for image and video generation. Despite the dominance of Transformer architectures in various fields due to their flexibility and scalability, the visual generative domain primarily utilizes CNN-based U-Net architectures, particularly in diffusion-based models. We introduce GenTron, a family of Generative models employing Transformer-based diffusion, to address this gap. Our initial step was to adapt Diffusion Transformers (DiTs) from class to text conditioning, a process involving thorough empirical exploration of the conditioning mechanism. We then scale GenTron from approximately 900M to over 3B parameters, observing significant improvements in visual quality. Furthermore, we extend GenTron to text-to-video generation, incorporating novel motion-free guidance to enhance video quality. In human evaluations against SDXL, GenTron achieves a 51.1% win rate in visual quality (with a 19.8% draw rate), and a 42.3% win rate in text alignment (with a 42.9% draw rate). GenTron also excels in the T2I-CompBench, underscoring its strengths in compositional generation. We believe this work will provide meaningful insights and serve as a valuable reference for future research.

6/4/2024