Direct3D: Scalable Image-to-3D Generation via 3D Latent Diffusion Transformer

2405.14832

Published 6/4/2024 by Shuang Wu, Youtian Lin, Feihu Zhang, Yifei Zeng, Jingxi Xu, Philip Torr, Xun Cao, Yao Yao

🛸

Abstract

Generating high-quality 3D assets from text and images has long been challenging, primarily due to the absence of scalable 3D representations capable of capturing intricate geometry distributions. In this work, we introduce Direct3D, a native 3D generative model scalable to in-the-wild input images, without requiring a multiview diffusion model or SDS optimization. Our approach comprises two primary components: a Direct 3D Variational Auto-Encoder (D3D-VAE) and a Direct 3D Diffusion Transformer (D3D-DiT). D3D-VAE efficiently encodes high-resolution 3D shapes into a compact and continuous latent triplane space. Notably, our method directly supervises the decoded geometry using a semi-continuous surface sampling strategy, diverging from previous methods relying on rendered images as supervision signals. D3D-DiT models the distribution of encoded 3D latents and is specifically designed to fuse positional information from the three feature maps of the triplane latent, enabling a native 3D generative model scalable to large-scale 3D datasets. Additionally, we introduce an innovative image-to-3D generation pipeline incorporating semantic and pixel-level image conditions, allowing the model to produce 3D shapes consistent with the provided conditional image input. Extensive experiments demonstrate the superiority of our large-scale pre-trained Direct3D over previous image-to-3D approaches, achieving significantly better generation quality and generalization ability, thus establishing a new state-of-the-art for 3D content creation. Project page: https://nju-3dv.github.io/projects/Direct3D/.

Create account to get full access

Overview

Generating high-quality 3D assets from text and images has long been a challenging task
The authors introduce Direct3D, a new approach that can generate 3D shapes directly from single-view images without the need for complex optimization or multi-view diffusion models
Direct3D comprises two key components: a Direct 3D Variational Auto-Encoder (D3D-VAE) and a Direct 3D Diffusion Transformer (D3D-DiT)
The method can produce 3D shapes consistent with provided image conditions, outperforming previous state-of-the-art text-to-3D and image-to-3D generation approaches

Plain English Explanation

Creating high-quality 3D models from text descriptions or images has historically been a challenging problem. The authors of this paper have developed a new technique called Direct3D that can generate 3D shapes directly from single-view images, without requiring complex optimization steps or the use of multiple camera views.

Direct3D has two main components. The first is a Direct 3D Variational Auto-Encoder (D3D-VAE), which efficiently encodes high-resolution 3D shapes into a compact, continuous latent space. Importantly, this encoding process directly supervises the decoded geometry, rather than relying on rendered images as the training signal.

The second component is the Direct 3D Diffusion Transformer (D3D-DiT), which models the distribution of the encoded 3D latents. This transformer is designed to effectively fuse the positional information from the three feature maps of the latent triplane representation, enabling a "native" 3D generative model that can scale to large-scale 3D datasets.

Additionally, the authors introduce an image-to-3D generation pipeline that incorporates both semantic and pixel-level image conditions. This allows the model to produce 3D shapes that are consistent with the provided input image.

Through extensive experiments, the researchers demonstrate that their large-scale pre-trained Direct3D model outperforms previous state-of-the-art approaches for text-to-3D and image-to-3D generation, in terms of both generation quality and generalization ability. This represents a significant advancement in the field of 3D content creation.

Technical Explanation

The authors introduce a new approach called Direct3D, which consists of two primary components: a Direct 3D Variational Auto-Encoder (D3D-VAE) and a Direct 3D Diffusion Transformer (D3D-DiT).

The D3D-VAE efficiently encodes high-resolution 3D shapes into a compact, continuous latent triplane space. Unlike previous methods that rely on rendered images as supervision signals, the authors' approach directly supervises the decoded geometry using a semi-continuous surface sampling strategy.

The D3D-DiT models the distribution of the encoded 3D latents and is designed to effectively fuse the positional information from the three feature maps of the triplane latent, enabling a "native" 3D generative model that can scale to large-scale 3D datasets. This approach contrasts with previous methods that require multi-view diffusion models or SDS optimization, such as PI3D and DiffTF-3D.

Additionally, the authors introduce an innovative image-to-3D generation pipeline that incorporates both semantic and pixel-level image conditions. This allows the model to produce 3D shapes that are consistent with the provided input image, unlike approaches like VolumeDiffusion and DiffusionGAN3D that may struggle with direct image-to-3D generation.

Critical Analysis

The authors acknowledge several caveats and limitations of their work. For example, they note that while Direct3D outperforms previous approaches, there is still room for improvement in terms of the generated 3D shapes' fidelity and consistency with the input conditions.

Additionally, the authors highlight the need for further research to address challenges such as better handling of object occlusions, supporting more diverse 3D object categories, and improving the efficiency of the 3D generation process.

One potential concern that could be raised is the reliance on a triplane latent representation, which may not be able to capture all the complexities of real-world 3D shapes. The authors could explore alternative latent representations or hierarchical approaches to address this limitation.

Furthermore, the authors do not provide a detailed analysis of the computational and memory requirements of their model, which would be valuable information for practitioners considering the practical deployment of Direct3D.

Overall, the authors have made a significant contribution to the field of 3D content creation, but there remain opportunities for further research and refinement of the approach.

Conclusion

The authors have developed a novel 3D generation model called Direct3D, which can efficiently generate high-quality 3D shapes directly from single-view input images. This represents a significant advancement over previous state-of-the-art approaches that require complex optimization or multi-view diffusion models.

The key innovations of Direct3D include the Direct 3D Variational Auto-Encoder (D3D-VAE) for compact 3D shape encoding, and the Direct 3D Diffusion Transformer (D3D-DiT) for scalable 3D latent modeling. The researchers have also introduced an effective image-to-3D generation pipeline that can produce 3D shapes consistent with the provided input conditions.

The authors' extensive experiments demonstrate the superiority of Direct3D over previous text-to-3D and image-to-3D generation methods, establishing a new state-of-the-art for 3D content creation. This work has important implications for a wide range of applications, from virtual reality and gaming to product design and e-commerce.

While the authors have made significant progress, they acknowledge the need for further research to address remaining challenges and limitations. Overall, the Direct3D model represents an important step forward in the quest to enable more efficient and scalable 3D content generation from text and images.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data

Qihao Liu, Yi Zhang, Song Bai, Adam Kortylewski, Alan Yuille

We present DIRECT-3D, a diffusion-based 3D generative model for creating high-quality 3D assets (represented by Neural Radiance Fields) from text prompts. Unlike recent 3D generative models that rely on clean and well-aligned 3D data, limiting them to single or few-class generation, our model is directly trained on extensive noisy and unaligned `in-the-wild' 3D assets, mitigating the key challenge (i.e., data scarcity) in large-scale 3D generation. In particular, DIRECT-3D is a tri-plane diffusion model that integrates two innovations: 1) A novel learning framework where noisy data are filtered and aligned automatically during the training process. Specifically, after an initial warm-up phase using a small set of clean data, an iterative optimization is introduced in the diffusion process to explicitly estimate the 3D pose of objects and select beneficial data based on conditional density. 2) An efficient 3D representation that is achieved by disentangling object geometry and color features with two separate conditional diffusion models that are optimized hierarchically. Given a prompt input, our model generates high-quality, high-resolution, realistic, and complex 3D objects with accurate geometric details in seconds. We achieve state-of-the-art performance in both single-class generation and text-to-3D generation. We also demonstrate that DIRECT-3D can serve as a useful 3D geometric prior of objects, for example to alleviate the well-known Janus problem in 2D-lifting methods such as DreamFusion. The code and models are available for research purposes at: https://github.com/qihao067/direct3d.

6/10/2024

cs.CV

Dual3D: Efficient and Consistent Text-to-3D Generation with Dual-mode Multi-view Latent Diffusion

Xinyang Li, Zhangyu Lai, Linning Xu, Jianfei Guo, Liujuan Cao, Shengchuan Zhang, Bo Dai, Rongrong Ji

We present Dual3D, a novel text-to-3D generation framework that generates high-quality 3D assets from texts in only $1$ minute.The key component is a dual-mode multi-view latent diffusion model. Given the noisy multi-view latents, the 2D mode can efficiently denoise them with a single latent denoising network, while the 3D mode can generate a tri-plane neural surface for consistent rendering-based denoising. Most modules for both modes are tuned from a pre-trained text-to-image latent diffusion model to circumvent the expensive cost of training from scratch. To overcome the high rendering cost during inference, we propose the dual-mode toggling inference strategy to use only $1/10$ denoising steps with 3D mode, successfully generating a 3D asset in just $10$ seconds without sacrificing quality. The texture of the 3D asset can be further enhanced by our efficient texture refinement process in a short time. Extensive experiments demonstrate that our method delivers state-of-the-art performance while significantly reducing generation time. Our project page is available at https://dual3d.github.io

5/17/2024

cs.CV

PI3D: Efficient Text-to-3D Generation with Pseudo-Image Diffusion

Ying-Tian Liu, Yuan-Chen Guo, Guan Luo, Heyi Sun, Wei Yin, Song-Hai Zhang

Diffusion models trained on large-scale text-image datasets have demonstrated a strong capability of controllable high-quality image generation from arbitrary text prompts. However, the generation quality and generalization ability of 3D diffusion models is hindered by the scarcity of high-quality and large-scale 3D datasets. In this paper, we present PI3D, a framework that fully leverages the pre-trained text-to-image diffusion models' ability to generate high-quality 3D shapes from text prompts in minutes. The core idea is to connect the 2D and 3D domains by representing a 3D shape as a set of Pseudo RGB Images. We fine-tune an existing text-to-image diffusion model to produce such pseudo-images using a small number of text-3D pairs. Surprisingly, we find that it can already generate meaningful and consistent 3D shapes given complex text descriptions. We further take the generated shapes as the starting point for a lightweight iterative refinement using score distillation sampling to achieve high-quality generation under a low budget. PI3D generates a single 3D shape from text in only 3 minutes and the quality is validated to outperform existing 3D generative models by a large margin.

4/23/2024

cs.CV

DiffTF++: 3D-aware Diffusion Transformer for Large-Vocabulary 3D Generation

Ziang Cao, Fangzhou Hong, Tong Wu, Liang Pan, Ziwei Liu

Generating diverse and high-quality 3D assets automatically poses a fundamental yet challenging task in 3D computer vision. Despite extensive efforts in 3D generation, existing optimization-based approaches struggle to produce large-scale 3D assets efficiently. Meanwhile, feed-forward methods often focus on generating only a single category or a few categories, limiting their generalizability. Therefore, we introduce a diffusion-based feed-forward framework to address these challenges with a single model. To handle the large diversity and complexity in geometry and texture across categories efficiently, we 1) adopt improved triplane to guarantee efficiency; 2) introduce the 3D-aware transformer to aggregate the generalized 3D knowledge with specialized 3D features; and 3) devise the 3D-aware encoder/decoder to enhance the generalized 3D knowledge. Building upon our 3D-aware Diffusion model with TransFormer, DiffTF, we propose a stronger version for 3D generation, i.e., DiffTF++. It boils down to two parts: multi-view reconstruction loss and triplane refinement. Specifically, we utilize multi-view reconstruction loss to fine-tune the diffusion model and triplane decoder, thereby avoiding the negative influence caused by reconstruction errors and improving texture synthesis. By eliminating the mismatch between the two stages, the generative performance is enhanced, especially in texture. Additionally, a 3D-aware refinement process is introduced to filter out artifacts and refine triplanes, resulting in the generation of more intricate and reasonable details. Extensive experiments on ShapeNet and OmniObject3D convincingly demonstrate the effectiveness of our proposed modules and the state-of-the-art 3D object generation performance with large diversity, rich semantics, and high quality.

5/15/2024

cs.CV