VividDreamer: Towards High-Fidelity and Efficient Text-to-3D Generation

2406.14964

Published 6/24/2024 by Zixuan Chen, Ruijie Su, Jiahao Zhu, Lingxiao Yang, Jian-Huang Lai, Xiaohua Xie

VividDreamer: Towards High-Fidelity and Efficient Text-to-3D Generation

Abstract

Text-to-3D generation aims to create 3D assets from text-to-image diffusion models. However, existing methods face an inherent bottleneck in generation quality because the widely-used objectives such as Score Distillation Sampling (SDS) inappropriately omit U-Net jacobians for swift generation, leading to significant bias compared to the true gradient obtained by full denoising sampling. This bias brings inconsistent updating direction, resulting in implausible 3D generation e.g., color deviation, Janus problem, and semantically inconsistent details). In this work, we propose Pose-dependent Consistency Distillation Sampling (PCDS), a novel yet efficient objective for diffusion-based 3D generation tasks. Specifically, PCDS builds the pose-dependent consistency function within diffusion trajectories, allowing to approximate true gradients through minimal sampling steps (1-3). Compared to SDS, PCDS can acquire a more accurate updating direction with the same sampling time (1 sampling step), while enabling few-step (2-3) sampling to trade compute for higher generation quality. For efficient generation, we propose a coarse-to-fine optimization strategy, which first utilizes 1-step PCDS to create the basic structure of 3D objects, and then gradually increases PCDS steps to generate fine-grained details. Extensive experiments demonstrate that our approach outperforms the state-of-the-art in generation quality and training efficiency, conspicuously alleviating the implausible 3D generation issues caused by the deviated updating direction. Moreover, it can be simply applied to many 3D generative applications to yield impressive 3D assets, please see our project page: https://narcissusex.github.io/VividDreamer.

Create account to get full access

Overview

Presents a new text-to-3D generation model called VividDreamer that aims for high-fidelity and efficient 3D generation
Uses diffusion models and consistency models to generate 3D shapes from text descriptions
Introduces a novel 3D Gaussian splatting technique to improve the quality and efficiency of the generated 3D shapes

Plain English Explanation

VividDreamer: Towards High-Fidelity and Efficient Text-to-3D Generation is a research paper that describes a new AI model for generating 3D shapes from text descriptions. The key ideas are:

The model uses "diffusion models" and "consistency models" to convert text into 3D shapes. Diffusion models start with a random 3D shape and gradually refine it to match the text description, while consistency models ensure the generated shape is coherent and makes sense.
The paper introduces a novel technique called "3D Gaussian splatting" that improves the quality and efficiency of the generated 3D shapes. This involves representing the 3D shape as a collection of 3D "blobs" that can be rendered efficiently.
The goal is to create a system that can generate high-quality 3D shapes from text in a fast and efficient way, which could be useful for applications like video games, animations, and virtual reality.

Technical Explanation

VividDreamer: Towards High-Fidelity and Efficient Text-to-3D Generation presents a new text-to-3D generation model that combines diffusion models and consistency models to generate high-fidelity 3D shapes efficiently.

The diffusion model starts with a random 3D shape and gradually refines it to match the given text description, while the consistency model ensures the generated shape is coherent and semantically meaningful. The paper introduces a novel 3D Gaussian splatting technique to represent the 3D shape, which allows for efficient rendering and generation.

Experiments show that VividDreamer outperforms previous state-of-the-art text-to-3D generation models in terms of fidelity, efficiency, and semantic consistency. The model is able to generate diverse and plausible 3D shapes from a wide range of text descriptions.

Critical Analysis

The paper presents a promising approach to text-to-3D generation, but there are a few potential limitations and areas for further research:

The paper does not provide a detailed analysis of the computational complexity and runtime performance of the model, which would be important for real-world applications.
The evaluation is limited to subjective human judgments, and more rigorous quantitative metrics could be used to further assess the quality of the generated 3D shapes.
The model is trained on a relatively small dataset of text-3D pairs, and its performance on more diverse and challenging text descriptions is not explored.
Potential biases or limitations in the training data and model architecture are not discussed, which could lead to undesirable outputs or fairness issues.

PI3D: Efficient Text-to-3D Generation with Pseudo-Integrators and Grounded: Compositional, Diverse Text-to-3D Pretrained are related works that explore alternative approaches to text-to-3D generation that could be compared to VividDreamer.

Conclusion

VividDreamer: Towards High-Fidelity and Efficient Text-to-3D Generation presents a novel text-to-3D generation model that combines diffusion models and consistency models to produce high-quality 3D shapes efficiently. The key innovation is the use of 3D Gaussian splatting to represent the 3D shapes, which enables fast rendering and generation.

The results show that VividDreamer outperforms previous state-of-the-art models in terms of fidelity, efficiency, and semantic consistency. This could have important applications in areas like virtual reality, video games, and 3D design. However, further research is needed to address potential limitations and ensure the model's robustness and fairness.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Consistent3D: Towards Consistent High-Fidelity Text-to-3D Generation with Deterministic Sampling Prior

Zike Wu, Pan Zhou, Xuanyu Yi, Xiaoding Yuan, Hanwang Zhang

Score distillation sampling (SDS) and its variants have greatly boosted the development of text-to-3D generation, but are vulnerable to geometry collapse and poor textures yet. To solve this issue, we first deeply analyze the SDS and find that its distillation sampling process indeed corresponds to the trajectory sampling of a stochastic differential equation (SDE): SDS samples along an SDE trajectory to yield a less noisy sample which then serves as a guidance to optimize a 3D model. However, the randomness in SDE sampling often leads to a diverse and unpredictable sample which is not always less noisy, and thus is not a consistently correct guidance, explaining the vulnerability of SDS. Since for any SDE, there always exists an ordinary differential equation (ODE) whose trajectory sampling can deterministically and consistently converge to the desired target point as the SDE, we propose a novel and effective Consistent3D method that explores the ODE deterministic sampling prior for text-to-3D generation. Specifically, at each training iteration, given a rendered image by a 3D model, we first estimate its desired 3D score function by a pre-trained 2D diffusion model, and build an ODE for trajectory sampling. Next, we design a consistency distillation sampling loss which samples along the ODE trajectory to generate two adjacent samples and uses the less noisy sample to guide another more noisy one for distilling the deterministic prior into the 3D model. Experimental results show the efficacy of our Consistent3D in generating high-fidelity and diverse 3D objects and large-scale scenes, as shown in Fig. 1. The codes are available at https://github.com/sail-sg/Consistent3D.

6/14/2024

cs.CV cs.LG

PI3D: Efficient Text-to-3D Generation with Pseudo-Image Diffusion

Ying-Tian Liu, Yuan-Chen Guo, Guan Luo, Heyi Sun, Wei Yin, Song-Hai Zhang

Diffusion models trained on large-scale text-image datasets have demonstrated a strong capability of controllable high-quality image generation from arbitrary text prompts. However, the generation quality and generalization ability of 3D diffusion models is hindered by the scarcity of high-quality and large-scale 3D datasets. In this paper, we present PI3D, a framework that fully leverages the pre-trained text-to-image diffusion models' ability to generate high-quality 3D shapes from text prompts in minutes. The core idea is to connect the 2D and 3D domains by representing a 3D shape as a set of Pseudo RGB Images. We fine-tune an existing text-to-image diffusion model to produce such pseudo-images using a small number of text-3D pairs. Surprisingly, we find that it can already generate meaningful and consistent 3D shapes given complex text descriptions. We further take the generated shapes as the starting point for a lightweight iterative refinement using score distillation sampling to achieve high-quality generation under a low budget. PI3D generates a single 3D shape from text in only 3 minutes and the quality is validated to outperform existing 3D generative models by a large margin.

4/23/2024

cs.CV

📈

Grounded Compositional and Diverse Text-to-3D with Pretrained Multi-View Diffusion Model

Xiaolong Li, Jiawei Mo, Ying Wang, Chethan Parameshwara, Xiaohan Fei, Ashwin Swaminathan, CJ Taylor, Zhuowen Tu, Paolo Favaro, Stefano Soatto

In this paper, we propose an effective two-stage approach named Grounded-Dreamer to generate 3D assets that can accurately follow complex, compositional text prompts while achieving high fidelity by using a pre-trained multi-view diffusion model. Multi-view diffusion models, such as MVDream, have shown to generate high-fidelity 3D assets using score distillation sampling (SDS). However, applied naively, these methods often fail to comprehend compositional text prompts, and may often entirely omit certain subjects or parts. To address this issue, we first advocate leveraging text-guided 4-view images as the bottleneck in the text-to-3D pipeline. We then introduce an attention refocusing mechanism to encourage text-aligned 4-view image generation, without the necessity to re-train the multi-view diffusion model or craft a high-quality compositional 3D dataset. We further propose a hybrid optimization strategy to encourage synergy between the SDS loss and the sparse RGB reference images. Our method consistently outperforms previous state-of-the-art (SOTA) methods in generating compositional 3D assets, excelling in both quality and accuracy, and enabling diverse 3D from the same text prompt.

4/30/2024

cs.CV cs.AI

DreamTime: An Improved Optimization Strategy for Diffusion-Guided 3D Generation

Yukun Huang, Jianan Wang, Yukai Shi, Boshi Tang, Xianbiao Qi, Lei Zhang

Text-to-image diffusion models pre-trained on billions of image-text pairs have recently enabled 3D content creation by optimizing a randomly initialized differentiable 3D representation with score distillation. However, the optimization process suffers slow convergence and the resultant 3D models often exhibit two limitations: (a) quality concerns such as missing attributes and distorted shape and texture; (b) extremely low diversity comparing to text-guided image synthesis. In this paper, we show that the conflict between the 3D optimization process and uniform timestep sampling in score distillation is the main reason for these limitations. To resolve this conflict, we propose to prioritize timestep sampling with monotonically non-increasing functions, which aligns the 3D optimization process with the sampling process of diffusion model. Extensive experiments show that our simple redesign significantly improves 3D content creation with faster convergence, better quality and diversity.

5/7/2024

cs.CV cs.GR cs.LG