Generative AI meets 3D: A Survey on Text-to-3D in AIGC Era

2305.06131

Published 6/11/2024 by Chenghao Li, Chaoning Zhang, Atish Waghwase, Lik-Hang Lee, Francois Rameau, Yang Yang, Sung-Ho Bae, Choong Seon Hong

cs.CV

🤖

Abstract

Generative AI (AIGC, a.k.a. AI generated content) has made significant progress in recent years, with text-guided content generation being the most practical as it facilitates interaction between human instructions and AIGC. Due to advancements in text-to-image and 3D modeling technologies (like NeRF), text-to-3D has emerged as a nascent yet highly active research field. Our work conducts the first comprehensive survey and follows up on subsequent research progress in the overall field, aiming to help readers interested in this direction quickly catch up with its rapid development. First, we introduce 3D data representations, including both Euclidean and non-Euclidean data. Building on this foundation, we introduce various foundational technologies and summarize how recent work combines these foundational technologies to achieve satisfactory text-to-3D results. Additionally, we present mainstream baselines and research directions in recent text-to-3D technology, including fidelity, efficiency, consistency, controllability, diversity, and applicability. Furthermore, we summarize the usage of text-to-3D technology in various applications, including avatar generation, texture generation, shape editing, and scene generation.

Create account to get full access

Overview

Generative AI, also known as AI-generated content (AIGC), has made significant progress in recent years.
Text-guided content generation is the most practical application of AIGC, facilitating interaction between human instructions and the AI system.
Advancements in text-to-image and 3D modeling technologies have led to the emergence of text-to-3D as a nascent yet highly active research field.
This paper aims to provide a comprehensive survey of the text-to-3D research field and its subsequent progress.

Plain English Explanation

Generative AI, or AI-generated content, has become more advanced in recent years. The most useful application of this technology is allowing people to give the AI written instructions, which it then uses to create new content. With improvements in technology that can generate images and 3D models from text, there is a growing field of research focused on text-to-3D generation - creating 3D objects and scenes based on text descriptions.

This paper provides an overview of this emerging field of text-to-3D generation. It starts by explaining the different ways 3D data can be represented, both using traditional Euclidean geometry and newer non-Euclidean approaches. Building on this foundation, the paper then summarizes the key technologies and recent research progress in achieving satisfactory text-to-3D results.

The paper also covers the current benchmarks and research directions in text-to-3D, including aspects like fidelity, efficiency, consistency, controllability, diversity, and applicability. Finally, it summarizes how text-to-3D technology is being used in various applications like avatar generation, texture generation, shape editing, and scene generation.

Technical Explanation

The paper first introduces the different representations used for 3D data, including both traditional Euclidean approaches and newer non-Euclidean representations. It then provides an overview of the foundational technologies that underpin text-to-3D generation, such as text-to-image and 3D modeling techniques like NeRF.

Building on this foundation, the paper summarizes how recent research has combined these technologies to achieve satisfactory text-to-3D results. This includes discussion of key performance metrics like fidelity (how accurately the generated 3D content matches the input text), efficiency (the computational resources required), consistency (ensuring coherence between different generated elements), controllability (the ability to steer the generation process), diversity (generating a range of unique outputs), and applicability (the breadth of use cases).

The paper also covers mainstream baselines and research directions in the text-to-3D field, highlighting key technical advances and their implications.

Critical Analysis

The paper provides a comprehensive overview of the text-to-3D research field, covering both the foundational technologies and the latest progress. However, it acknowledges that text-to-3D generation remains a nascent and highly active area, with significant room for further advancement.

Some potential limitations and areas for future research mentioned in the paper include improving the fidelity and consistency of generated 3D content, enhancing the efficiency and controllability of the generation process, and expanding the diversity and applicability of the technology.

Additionally, the paper does not address potential ethical concerns or societal implications of widespread text-to-3D generation, which could be an important area for future research and discussion.

Conclusion

This paper provides a comprehensive overview of the emerging field of text-to-3D generation, covering the foundational technologies, recent research progress, and current challenges and opportunities.

By summarizing the key advancements and benchmarks in areas like fidelity, efficiency, consistency, controllability, diversity, and applicability, the paper provides a valuable resource for researchers and practitioners interested in this rapidly evolving field of generative AI.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

A Survey On Text-to-3D Contents Generation In The Wild

Chenhan Jiang

3D content creation plays a vital role in various applications, such as gaming, robotics simulation, and virtual reality. However, the process is labor-intensive and time-consuming, requiring skilled designers to invest considerable effort in creating a single 3D asset. To address this challenge, text-to-3D generation technologies have emerged as a promising solution for automating 3D creation. Leveraging the success of large vision language models, these techniques aim to generate 3D content based on textual descriptions. Despite recent advancements in this area, existing solutions still face significant limitations in terms of generation quality and efficiency. In this survey, we conduct an in-depth investigation of the latest text-to-3D creation methods. We provide a comprehensive background on text-to-3D creation, including discussions on datasets employed in training and evaluation metrics used to assess the quality of generated 3D models. Then, we delve into the various 3D representations that serve as the foundation for the 3D generation process. Furthermore, we present a thorough comparison of the rapidly growing literature on generative pipelines, categorizing them into feedforward generators, optimization-based generation, and view reconstruction approaches. By examining the strengths and weaknesses of these methods, we aim to shed light on their respective capabilities and limitations. Lastly, we point out several promising avenues for future research. With this survey, we hope to inspire researchers further to explore the potential of open-vocabulary text-conditioned 3D content creation.

5/16/2024

cs.CV cs.GR

🛸

Instant3D: Instant Text-to-3D Generation

Ming Li, Pan Zhou, Jia-Wei Liu, Jussi Keppo, Min Lin, Shuicheng Yan, Xiangyu Xu

Text-to-3D generation has attracted much attention from the computer vision community. Existing methods mainly optimize a neural field from scratch for each text prompt, relying on heavy and repetitive training cost which impedes their practical deployment. In this paper, we propose a novel framework for fast text-to-3D generation, dubbed Instant3D. Once trained, Instant3D is able to create a 3D object for an unseen text prompt in less than one second with a single run of a feedforward network. We achieve this remarkable speed by devising a new network that directly constructs a 3D triplane from a text prompt. The core innovation of our Instant3D lies in our exploration of strategies to effectively inject text conditions into the network. In particular, we propose to combine three key mechanisms: cross-attention, style injection, and token-to-plane transformation, which collectively ensure precise alignment of the output with the input text. Furthermore, we propose a simple yet effective activation function, the scaled-sigmoid, to replace the original sigmoid function, which speeds up the training convergence by more than ten times. Finally, to address the Janus (multi-head) problem in 3D generation, we propose an adaptive Perp-Neg algorithm that can dynamically adjust its concept negation scales according to the severity of the Janus problem during training, effectively reducing the multi-head effect. Extensive experiments on a wide variety of benchmark datasets demonstrate that the proposed algorithm performs favorably against the state-of-the-art methods both qualitatively and quantitatively, while achieving significantly better efficiency. The code, data, and models are available at https://github.com/ming1993li/Instant3DCodes.

4/30/2024

cs.CV cs.AI cs.GR cs.LG cs.MM

Idea-2-3D: Collaborative LMM Agents Enable 3D Model Generation from Interleaved Multimodal Inputs

Junhao Chen, Xiang Li, Xiaojun Ye, Chao Li, Zhaoxin Fan, Hao Zhao

In this paper, we pursue a novel 3D AIGC setting: generating 3D content from IDEAs. The definition of an IDEA is the composition of multimodal inputs including text, image, and 3D models. To our knowledge, this challenging and appealing 3D AIGC setting has not been studied before. We propose the novel framework called Idea-2-3D to achieve this goal, which consists of three agents based upon large multimodel models (LMMs) and several existing algorithmic tools for them to invoke. Specifically, these three LMM-based agents are prompted to do the jobs of prompt generation, model selection and feedback reflection. They work in a cycle that involves both mutual collaboration and criticism. Note that this cycle is done in a fully automatic manner, without any human intervention. The framework then outputs a text prompt to generate 3D models that well align with input IDEAs. We show impressive 3D AIGC results that are beyond any previous methods can achieve. For quantitative comparisons, we construct caption-based baselines using a whole bunch of state-of-the-art 3D AIGC models and demonstrate Idea-2-3D out-performs significantly. In 94.2% of cases, Idea-2-3D meets users' requirements, marking a degree of match between IDEA and 3D models that is 2.3 times higher than baselines. Moreover, in 93.5% of the cases, users agreed that Idea-2-3D was better than baselines. Codes, data and models will made publicly available.

4/9/2024

cs.CV

Interactive3D: Create What You Want by Interactive 3D Generation

Shaocong Dong, Lihe Ding, Zhanpeng Huang, Zibin Wang, Tianfan Xue, Dan Xu

3D object generation has undergone significant advancements, yielding high-quality results. However, fall short of achieving precise user control, often yielding results that do not align with user expectations, thus limiting their applicability. User-envisioning 3D object generation faces significant challenges in realizing its concepts using current generative models due to limited interaction capabilities. Existing methods mainly offer two approaches: (i) interpreting textual instructions with constrained controllability, or (ii) reconstructing 3D objects from 2D images. Both of them limit customization to the confines of the 2D reference and potentially introduce undesirable artifacts during the 3D lifting process, restricting the scope for direct and versatile 3D modifications. In this work, we introduce Interactive3D, an innovative framework for interactive 3D generation that grants users precise control over the generative process through extensive 3D interaction capabilities. Interactive3D is constructed in two cascading stages, utilizing distinct 3D representations. The first stage employs Gaussian Splatting for direct user interaction, allowing modifications and guidance of the generative direction at any intermediate step through (i) Adding and Removing components, (ii) Deformable and Rigid Dragging, (iii) Geometric Transformations, and (iv) Semantic Editing. Subsequently, the Gaussian splats are transformed into InstantNGP. We introduce a novel (v) Interactive Hash Refinement module to further add details and extract the geometry in the second stage. Our experiments demonstrate that Interactive3D markedly improves the controllability and quality of 3D generation. Our project webpage is available at url{https://interactive-3d.github.io/}.

4/26/2024

cs.GR cs.CV