BoostDream: Efficient Refining for High-Quality Text-to-3D Generation from Multi-View Diffusion

Read original: arXiv:2401.16764 - Published 9/18/2024 by Yonghao Yu, Shunan Zhu, Huai Qin, Haorui Li

🛸

Overview

Text-to-3D generation is an active area of research with two main approaches:
- Feed-forward generation models can quickly produce 3D assets, but the results are often coarse.
- Score Distillation Sampling (SDS) models generate high-quality 3D assets, but at a slower pace.

Plain English Explanation

The paper presents a new method called BoostDream that aims to efficiently refine coarse 3D assets into high-quality ones. BoostDream has three key components:

3D Model Distillation: BoostDream fits differentiable representations from the 3D assets generated by feed-forward models.
Novel Multi-View SDS Loss: BoostDream uses a multi-view aware 2D diffusion model to refine the 3D assets.
Prompt and Multi-View Normal Map Guidance: BoostDream uses prompt and multi-view consistent normal maps to guide the refinement process.

The authors claim that BoostDream can generate high-quality 3D assets rapidly, overcoming the limitations of conventional SDS-based methods.

Technical Explanation

The paper introduces BoostDream, a highly efficient 3D refinement method that combines the strengths of feed-forward generation and SDS-based techniques.

3D Model Distillation: The authors fit differentiable representations, such as signed distance fields or neural radiance fields, to the 3D assets generated by feed-forward models. This allows BoostDream to work with a continuous 3D representation that can be further refined.
Novel Multi-View SDS Loss: BoostDream uses a 2D diffusion model that is aware of multiple views of the 3D asset. This multi-view SDS loss enables the refinement process to consider the 3D structure from various perspectives, leading to higher-quality results.
Prompt and Multi-View Normal Map Guidance: The authors propose using the original text prompt and multi-view consistent normal maps as additional guidance during the refinement process. This helps BoostDream maintain the semantic and geometric fidelity of the final 3D asset.

The authors conduct extensive experiments on different differentiable 3D representations, demonstrating that BoostDream can generate high-quality 3D assets quickly, outperforming conventional SDS-based methods.

Critical Analysis

The paper presents a promising approach to improving the efficiency and quality of text-to-3D generation. However, the authors do not discuss potential limitations or areas for further research in depth. For example, it would be valuable to understand how BoostDream performs on a wider range of 3D asset types and how it compares to other state-of-the-art text-to-3D methods.

Additionally, the paper does not provide a detailed analysis of the computational complexity and runtime of BoostDream compared to other techniques. This information would be helpful for understanding the practical implications of deploying BoostDream in real-world applications.

Conclusion

The BoostDream framework represents a significant advancement in text-to-3D generation, combining the speed of feed-forward models with the high-quality results of SDS-based methods. By introducing 3D model distillation, a novel multi-view SDS loss, and prompt and normal map guidance, the authors have developed an efficient and effective technique for refining coarse 3D assets into high-fidelity representations. This work has the potential to enhance various applications, such as 3D content creation, virtual environments, and product design, by enabling more accessible and realistic 3D generation from text.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🛸

BoostDream: Efficient Refining for High-Quality Text-to-3D Generation from Multi-View Diffusion

Yonghao Yu, Shunan Zhu, Huai Qin, Haorui Li

Witnessing the evolution of text-to-image diffusion models, significant strides have been made in text-to-3D generation. Currently, two primary paradigms dominate the field of text-to-3D: the feed-forward generation solutions, capable of swiftly producing 3D assets but often yielding coarse results, and the Score Distillation Sampling (SDS) based solutions, known for generating high-fidelity 3D assets albeit at a slower pace. The synergistic integration of these methods holds substantial promise for advancing 3D generation techniques. In this paper, we present BoostDream, a highly efficient plug-and-play 3D refining method designed to transform coarse 3D assets into high-quality. The BoostDream framework comprises three distinct processes: (1) We introduce 3D model distillation that fits differentiable representations from the 3D assets obtained through feed-forward generation. (2) A novel multi-view SDS loss is designed, which utilizes a multi-view aware 2D diffusion model to refine the 3D assets. (3) We propose to use prompt and multi-view consistent normal maps as guidance in refinement.Our extensive experiment is conducted on different differentiable 3D representations, revealing that BoostDream excels in generating high-quality 3D assets rapidly, overcoming the Janus problem compared to conventional SDS-based methods. This breakthrough signifies a substantial advancement in both the efficiency and quality of 3D generation processes.

9/18/2024

🛸

Retrieval-Augmented Score Distillation for Text-to-3D Generation

Junyoung Seo, Susung Hong, Wooseok Jang, In`es Hyeonsu Kim, Minseop Kwak, Doyup Lee, Seungryong Kim

Text-to-3D generation has achieved significant success by incorporating powerful 2D diffusion models, but insufficient 3D prior knowledge also leads to the inconsistency of 3D geometry. Recently, since large-scale multi-view datasets have been released, fine-tuning the diffusion model on the multi-view datasets becomes a mainstream to solve the 3D inconsistency problem. However, it has confronted with fundamental difficulties regarding the limited quality and diversity of 3D data, compared with 2D data. To sidestep these trade-offs, we explore a retrieval-augmented approach tailored for score distillation, dubbed ReDream. We postulate that both expressiveness of 2D diffusion models and geometric consistency of 3D assets can be fully leveraged by employing the semantically relevant assets directly within the optimization process. To this end, we introduce novel framework for retrieval-based quality enhancement in text-to-3D generation. We leverage the retrieved asset to incorporate its geometric prior in the variational objective and adapt the diffusion model's 2D prior toward view consistency, achieving drastic improvements in both geometry and fidelity of generated scenes. We conduct extensive experiments to demonstrate that ReDream exhibits superior quality with increased geometric consistency. Project page is available at https://ku-cvlab.github.io/ReDream/.

5/3/2024

📈

Grounded Compositional and Diverse Text-to-3D with Pretrained Multi-View Diffusion Model

Xiaolong Li, Jiawei Mo, Ying Wang, Chethan Parameshwara, Xiaohan Fei, Ashwin Swaminathan, CJ Taylor, Zhuowen Tu, Paolo Favaro, Stefano Soatto

In this paper, we propose an effective two-stage approach named Grounded-Dreamer to generate 3D assets that can accurately follow complex, compositional text prompts while achieving high fidelity by using a pre-trained multi-view diffusion model. Multi-view diffusion models, such as MVDream, have shown to generate high-fidelity 3D assets using score distillation sampling (SDS). However, applied naively, these methods often fail to comprehend compositional text prompts, and may often entirely omit certain subjects or parts. To address this issue, we first advocate leveraging text-guided 4-view images as the bottleneck in the text-to-3D pipeline. We then introduce an attention refocusing mechanism to encourage text-aligned 4-view image generation, without the necessity to re-train the multi-view diffusion model or craft a high-quality compositional 3D dataset. We further propose a hybrid optimization strategy to encourage synergy between the SDS loss and the sparse RGB reference images. Our method consistently outperforms previous state-of-the-art (SOTA) methods in generating compositional 3D assets, excelling in both quality and accuracy, and enabling diverse 3D from the same text prompt.

4/30/2024

VividDreamer: Towards High-Fidelity and Efficient Text-to-3D Generation

Zixuan Chen, Ruijie Su, Jiahao Zhu, Lingxiao Yang, Jian-Huang Lai, Xiaohua Xie

Text-to-3D generation aims to create 3D assets from text-to-image diffusion models. However, existing methods face an inherent bottleneck in generation quality because the widely-used objectives such as Score Distillation Sampling (SDS) inappropriately omit U-Net jacobians for swift generation, leading to significant bias compared to the true gradient obtained by full denoising sampling. This bias brings inconsistent updating direction, resulting in implausible 3D generation e.g., color deviation, Janus problem, and semantically inconsistent details). In this work, we propose Pose-dependent Consistency Distillation Sampling (PCDS), a novel yet efficient objective for diffusion-based 3D generation tasks. Specifically, PCDS builds the pose-dependent consistency function within diffusion trajectories, allowing to approximate true gradients through minimal sampling steps (1-3). Compared to SDS, PCDS can acquire a more accurate updating direction with the same sampling time (1 sampling step), while enabling few-step (2-3) sampling to trade compute for higher generation quality. For efficient generation, we propose a coarse-to-fine optimization strategy, which first utilizes 1-step PCDS to create the basic structure of 3D objects, and then gradually increases PCDS steps to generate fine-grained details. Extensive experiments demonstrate that our approach outperforms the state-of-the-art in generation quality and training efficiency, conspicuously alleviating the implausible 3D generation issues caused by the deviated updating direction. Moreover, it can be simply applied to many 3D generative applications to yield impressive 3D assets, please see our project page: https://narcissusex.github.io/VividDreamer.

6/24/2024