Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers

Read original: arXiv:2405.05945 - Published 6/14/2024 by Peng Gao, Le Zhuo, Dongyang Liu, Ruoyi Du, Xu Luo, Longtian Qiu, Yuhang Zhang, Chen Lin, Rongjie Huang, Shijie Geng and 10 others

Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers

Overview

Lumina-T2X is a novel method for transforming text into any modality, resolution, and duration using flow-based large diffusion transformers.
It can generate high-quality images, videos, and other media from text inputs.
The paper introduces a new architecture called Flow-based Large Diffusion Transformers (Flag-DiT) that enables these capabilities.

Plain English Explanation

Lumina-T2X is an exciting new system that can take text as input and convert it into all kinds of media - images, videos, audio, and more. The key innovation is a new type of machine learning model called a "Flow-based Large Diffusion Transformer" that can effectively learn to generate these different types of content from just text alone.

This means you could, for example, describe a scene in words and have Lumina-T2X automatically generate a high-quality image or video of that scene for you. Or you could write a script and have it turned into an animated short film. The possibilities are really quite broad.

The power of Lumina-T2X comes from its ability to understand the semantic meaning and visual/auditory concepts conveyed by the text, and then use that understanding to create coherent and compelling media outputs. It's not just mechanically translating words into pixels or audio, but actually comprehending the underlying ideas and then generating appropriate visuals or audio to bring them to life.

This kind of technology could have a huge impact, making it much easier for people to create rich, immersive content without needing specialized artistic or technical skills. It could revolutionize fields like filmmaking, game development, advertising, education, and many others.

Technical Explanation

At the heart of Lumina-T2X is a new neural network architecture called Flow-based Large Diffusion Transformers (Flag-DiT). This combines the power of large language models like GPT with the image/video generation capabilities of diffusion models.

The key innovation is the "flow-based" aspect, which allows the model to efficiently transform the text input into the desired output modality (e.g. image, video, audio) by learning a series of learned, reversible "flows" between the text and media representations. This enables high-fidelity generation without the need for complex conditioning or iterative refinement.

The model is trained on large datasets spanning text, images, videos, and other media, allowing it to learn the rich connections between language and different sensory modalities. During inference, the text input is passed through the flow-based transformations to generate the requested output, which can be at any desired resolution or duration.

Experiments show that Lumina-T2X is able to generate high-quality, semantically coherent outputs across a wide range of modalities, outperforming prior text-to-media approaches. The authors also demonstrate the model's ability to handle multimodal inputs, generate diverse outputs, and maintain consistent styles and coherence.

Critical Analysis

One key limitation of Lumina-T2X mentioned in the paper is the computational cost and memory requirements of the large-scale diffusion models used. This could make the system challenging to deploy at scale or on resource-constrained devices.

The authors also note that the model may struggle with highly abstract or open-ended text prompts, and that the generated outputs can sometimes exhibit bias or inaccuracies reflecting the biases present in the training data. Further research is needed to address these issues and improve the model's robustness.

Additionally, there are important ethical considerations around the use of such powerful generative AI systems, such as the potential for misuse, the need for transparent attribution, and the societal impact of artificially generated media. The paper does not delve deeply into these concerns, which will be crucial for ensuring Lumina-T2X is developed and deployed responsibly.

Overall, Lumina-T2X represents a significant advance in text-to-media generation, demonstrating the vast potential of large diffusion models. However, further technical refinements and careful consideration of the social implications will be necessary to unlock the full benefits of this transformative technology.

Conclusion

Lumina-T2X is a groundbreaking system that can convert text into a wide variety of media outputs, from images and videos to audio and beyond. Its novel Flow-based Large Diffusion Transformer architecture enables high-fidelity generation across modalities, resolution, and duration.

This technology has the potential to revolutionize content creation, making it easier for people to bring their ideas to life in rich, immersive ways. However, it also raises important questions around the ethical use of such powerful generative AI systems. Careful development and deployment will be crucial to ensure Lumina-T2X is used responsibly and for the benefit of society.

Overall, the research presented in this paper represents a significant step forward in the field of multimodal AI, and its impact is likely to be felt across many industries and applications in the years to come.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers

Peng Gao, Le Zhuo, Dongyang Liu, Ruoyi Du, Xu Luo, Longtian Qiu, Yuhang Zhang, Chen Lin, Rongjie Huang, Shijie Geng, Renrui Zhang, Junlin Xi, Wenqi Shao, Zhengkai Jiang, Tianshuo Yang, Weicai Ye, He Tong, Jingwen He, Yu Qiao, Hongsheng Li

Sora unveils the potential of scaling Diffusion Transformer for generating photorealistic images and videos at arbitrary resolutions, aspect ratios, and durations, yet it still lacks sufficient implementation details. In this technical report, we introduce the Lumina-T2X family - a series of Flow-based Large Diffusion Transformers (Flag-DiT) equipped with zero-initialized attention, as a unified framework designed to transform noise into images, videos, multi-view 3D objects, and audio clips conditioned on text instructions. By tokenizing the latent spatial-temporal space and incorporating learnable placeholders such as [nextline] and [nextframe] tokens, Lumina-T2X seamlessly unifies the representations of different modalities across various spatial-temporal resolutions. This unified approach enables training within a single framework for different modalities and allows for flexible generation of multimodal data at any resolution, aspect ratio, and length during inference. Advanced techniques like RoPE, RMSNorm, and flow matching enhance the stability, flexibility, and scalability of Flag-DiT, enabling models of Lumina-T2X to scale up to 7 billion parameters and extend the context window to 128K tokens. This is particularly beneficial for creating ultra-high-definition images with our Lumina-T2I model and long 720p videos with our Lumina-T2V model. Remarkably, Lumina-T2I, powered by a 5-billion-parameter Flag-DiT, requires only 35% of the training computational costs of a 600-million-parameter naive DiT. Our further comprehensive analysis underscores Lumina-T2X's preliminary capability in resolution extrapolation, high-resolution editing, generating consistent 3D views, and synthesizing videos with seamless transitions. We expect that the open-sourcing of Lumina-T2X will further foster creativity, transparency, and diversity in the generative AI community.

6/14/2024

Lumina-Next: Making Lumina-T2X Stronger and Faster with Next-DiT

Le Zhuo, Ruoyi Du, Han Xiao, Yangguang Li, Dongyang Liu, Rongjie Huang, Wenze Liu, Lirui Zhao, Fu-Yun Wang, Zhanyu Ma, Xu Luo, Zehan Wang, Kaipeng Zhang, Xiangyang Zhu, Si Liu, Xiangyu Yue, Dingning Liu, Wanli Ouyang, Ziwei Liu, Yu Qiao, Hongsheng Li, Peng Gao

Lumina-T2X is a nascent family of Flow-based Large Diffusion Transformers that establishes a unified framework for transforming noise into various modalities, such as images and videos, conditioned on text instructions. Despite its promising capabilities, Lumina-T2X still encounters challenges including training instability, slow inference, and extrapolation artifacts. In this paper, we present Lumina-Next, an improved version of Lumina-T2X, showcasing stronger generation performance with increased training and inference efficiency. We begin with a comprehensive analysis of the Flag-DiT architecture and identify several suboptimal components, which we address by introducing the Next-DiT architecture with 3D RoPE and sandwich normalizations. To enable better resolution extrapolation, we thoroughly compare different context extrapolation methods applied to text-to-image generation with 3D RoPE, and propose Frequency- and Time-Aware Scaled RoPE tailored for diffusion transformers. Additionally, we introduced a sigmoid time discretization schedule to reduce sampling steps in solving the Flow ODE and the Context Drop method to merge redundant visual tokens for faster network evaluation, effectively boosting the overall sampling speed. Thanks to these improvements, Lumina-Next not only improves the quality and efficiency of basic text-to-image generation but also demonstrates superior resolution extrapolation capabilities and multilingual generation using decoder-based LLMs as the text encoder, all in a zero-shot manner. To further validate Lumina-Next as a versatile generative framework, we instantiate it on diverse tasks including visual recognition, multi-view, audio, music, and point cloud generation, showcasing strong performance across these domains. By releasing all codes and model weights, we aim to advance the development of next-generation generative AI capable of universal modeling.

6/28/2024

xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations

Can Qin, Congying Xia, Krithika Ramakrishnan, Michael Ryoo, Lifu Tu, Yihao Feng, Manli Shu, Honglu Zhou, Anas Awadalla, Jun Wang, Senthil Purushwalkam, Le Xue, Yingbo Zhou, Huan Wang, Silvio Savarese, Juan Carlos Niebles, Zeyuan Chen, Ran Xu, Caiming Xiong

We present xGen-VideoSyn-1, a text-to-video (T2V) generation model capable of producing realistic scenes from textual descriptions. Building on recent advancements, such as OpenAI's Sora, we explore the latent diffusion model (LDM) architecture and introduce a video variational autoencoder (VidVAE). VidVAE compresses video data both spatially and temporally, significantly reducing the length of visual tokens and the computational demands associated with generating long-sequence videos. To further address the computational costs, we propose a divide-and-merge strategy that maintains temporal consistency across video segments. Our Diffusion Transformer (DiT) model incorporates spatial and temporal self-attention layers, enabling robust generalization across different timeframes and aspect ratios. We have devised a data processing pipeline from the very beginning and collected over 13M high-quality video-text pairs. The pipeline includes multiple steps such as clipping, text detection, motion estimation, aesthetics scoring, and dense captioning based on our in-house video-LLM model. Training the VidVAE and DiT models required approximately 40 and 642 H100 days, respectively. Our model supports over 14-second 720p video generation in an end-to-end way and demonstrates competitive performance against state-of-the-art T2V models.

9/4/2024

Generative AI Beyond LLMs: System Implications of Multi-Modal Generation

Alicia Golden, Samuel Hsia, Fei Sun, Bilge Acun, Basil Hosmer, Yejin Lee, Zachary DeVito, Jeff Johnson, Gu-Yeon Wei, David Brooks, Carole-Jean Wu

As the development of large-scale Generative AI models evolve beyond text (1D) generation to include image (2D) and video (3D) generation, processing spatial and temporal information presents unique challenges to quality, performance, and efficiency. We present the first work towards understanding this new system design space for multi-modal text-to-image (TTI) and text-to-video (TTV) generation models. Current model architecture designs are bifurcated into 2 categories: Diffusion- and Transformer-based models. Our systematic performance characterization on a suite of eight representative TTI/TTV models shows that after state-of-the-art optimization techniques such as Flash Attention are applied, Convolution accounts for up to 44% of execution time for Diffusion-based TTI models, while Linear layers consume up to 49% of execution time for Transformer-based models. We additionally observe that Diffusion-based TTI models resemble the Prefill stage of LLM inference, and benefit from 1.1-2.5x greater speedup from Flash Attention than Transformer-based TTI models that resemble the Decode phase. Since optimizations designed for LLMs do not map directly onto TTI/TTV models, we must conduct a thorough characterization of these workloads to gain insights for new optimization opportunities. In doing so, we define sequence length in the context of TTI/TTV models and observe sequence length can vary up to 4x in Diffusion model inference. We additionally observe temporal aspects of TTV workloads pose unique system bottlenecks, with Temporal Attention accounting for over 60% of total Attention time. Overall, our in-depth system performance characterization is a critical first step towards designing efficient and deployable systems for emerging TTI/TTV workloads.

5/7/2024