CityDreamer: Compositional Generative Model of Unbounded 3D Cities

Read original: arXiv:2309.00610 - Published 6/7/2024 by Haozhe Xie, Zhaoxi Chen, Fangzhou Hong, Ziwei Liu

📈

Overview

3D city generation is a challenging task due to human sensitivity to structural distortions in urban environments and the wider range of building appearances compared to natural scenes.
To address these challenges, the researchers propose CityDreamer, a compositional generative model designed specifically for 3D city generation.
The key insight is that 3D city generation should be a composition of different types of neural fields: building instances and background stuff like roads and green spaces.

Plain English Explanation

The researchers developed a system called CityDreamer to generate realistic 3D cities. Generating 3D cities is more complex than generating natural 3D scenes because buildings can have a wide variety of appearances, while objects in nature tend to look more similar.

The researchers' approach involves breaking down the 3D city into two main components: the individual buildings and the background elements like roads and parks. They use specialized techniques to model each of these components, which allows the system to create more believable and diverse 3D cities.

The researchers also created a large dataset of real-world city imagery, called the CityGen Datasets, to help the system generate cities that look and feel more realistic.

Technical Explanation

The researchers propose CityDreamer, a compositional generative model for 3D city generation. The key insight is that 3D city generation should be a composition of different types of neural fields: 1) building instances and 2) background stuff, such as roads and green lands.

Specifically, the system uses a bird's eye view scene representation and employs a volumetric rendering approach for both the instance-oriented and stuff-oriented neural fields. The researchers tailor the generative hash grid and periodic positional embedding techniques to suit the distinct characteristics of building instances and background stuff.

Additionally, the researchers contribute the CityGen Datasets, which includes a vast amount of real-world city imagery from sources like OpenStreetMap and Google Earth. This dataset helps the system generate 3D cities that are more realistic in terms of both layout and appearance.

Critical Analysis

The researchers acknowledge that generating realistic 3D cities is a challenging task, as humans are highly sensitive to structural distortions in urban environments. They also note that 3D city generation is more complex than 3D natural scene generation due to the wider range of building appearances.

While the CityDreamer model and the CityGen Datasets represent significant advancements in the field, the researchers do not discuss potential limitations or areas for further research in detail. For example, it would be interesting to explore how the system might handle the generation of cities with unique architectural styles or cultural influences.

Additionally, the researchers could have compared their approach to other recent developments in 3D city generation, such as RealMDreamer, DreamScene, or StyleCity, to provide a more comprehensive understanding of the state of the art in this field.

Conclusion

The researchers have developed CityDreamer, a compositional generative model that addresses the challenges of 3D city generation. By breaking down the task into building instances and background stuff, the system is able to generate more realistic and diverse 3D cities.

The contribution of the CityGen Datasets, which includes a vast amount of real-world city imagery, is also a valuable addition that can help advance the field of 3D city generation. While the researchers have made significant progress, there are still opportunities for further exploration and improvement, such as addressing the generation of cities with unique architectural styles or cultural influences.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

📈

CityDreamer: Compositional Generative Model of Unbounded 3D Cities

Haozhe Xie, Zhaoxi Chen, Fangzhou Hong, Ziwei Liu

3D city generation is a desirable yet challenging task, since humans are more sensitive to structural distortions in urban environments. Additionally, generating 3D cities is more complex than 3D natural scenes since buildings, as objects of the same class, exhibit a wider range of appearances compared to the relatively consistent appearance of objects like trees in natural scenes. To address these challenges, we propose textbf{CityDreamer}, a compositional generative model designed specifically for unbounded 3D cities. Our key insight is that 3D city generation should be a composition of different types of neural fields: 1) various building instances, and 2) background stuff, such as roads and green lands. Specifically, we adopt the bird's eye view scene representation and employ a volumetric render for both instance-oriented and stuff-oriented neural fields. The generative hash grid and periodic positional embedding are tailored as scene parameterization to suit the distinct characteristics of building instances and background stuff. Furthermore, we contribute a suite of CityGen Datasets, including OSM and GoogleEarth, which comprises a vast amount of real-world city imagery to enhance the realism of the generated 3D cities both in their layouts and appearances. CityDreamer achieves state-of-the-art performance not only in generating realistic 3D cities but also in localized editing within the generated cities.

6/7/2024

✅

GraphDreamer: Compositional 3D Scene Synthesis from Scene Graphs

Gege Gao, Weiyang Liu, Anpei Chen, Andreas Geiger, Bernhard Scholkopf

As pretrained text-to-image diffusion models become increasingly powerful, recent efforts have been made to distill knowledge from these text-to-image pretrained models for optimizing a text-guided 3D model. Most of the existing methods generate a holistic 3D model from a plain text input. This can be problematic when the text describes a complex scene with multiple objects, because the vectorized text embeddings are inherently unable to capture a complex description with multiple entities and relationships. Holistic 3D modeling of the entire scene further prevents accurate grounding of text entities and concepts. To address this limitation, we propose GraphDreamer, a novel framework to generate compositional 3D scenes from scene graphs, where objects are represented as nodes and their interactions as edges. By exploiting node and edge information in scene graphs, our method makes better use of the pretrained text-to-image diffusion model and is able to fully disentangle different objects without image-level supervision. To facilitate modeling of object-wise relationships, we use signed distance fields as representation and impose a constraint to avoid inter-penetration of objects. To avoid manual scene graph creation, we design a text prompt for ChatGPT to generate scene graphs based on text inputs. We conduct both qualitative and quantitative experiments to validate the effectiveness of GraphDreamer in generating high-fidelity compositional 3D scenes with disentangled object entities.

6/12/2024

Urban Architect: Steerable 3D Urban Scene Generation with Layout Prior

Fan Lu, Kwan-Yee Lin, Yan Xu, Hongsheng Li, Guang Chen, Changjun Jiang

Text-to-3D generation has achieved remarkable success via large-scale text-to-image diffusion models. Nevertheless, there is no paradigm for scaling up the methodology to urban scale. Urban scenes, characterized by numerous elements, intricate arrangement relationships, and vast scale, present a formidable barrier to the interpretability of ambiguous textual descriptions for effective model optimization. In this work, we surmount the limitations by introducing a compositional 3D layout representation into text-to-3D paradigm, serving as an additional prior. It comprises a set of semantic primitives with simple geometric structures and explicit arrangement relationships, complementing textual descriptions and enabling steerable generation. Upon this, we propose two modifications -- (1) We introduce Layout-Guided Variational Score Distillation to address model optimization inadequacies. It conditions the score distillation sampling process with geometric and semantic constraints of 3D layouts. (2) To handle the unbounded nature of urban scenes, we represent 3D scene with a Scalable Hash Grid structure, incrementally adapting to the growing scale of urban scenes. Extensive experiments substantiate the capability of our framework to scale text-to-3D generation to large-scale urban scenes that cover over 1000m driving distance for the first time. We also present various scene editing demonstrations, showing the powers of steerable urban scene generation. Website: https://urbanarchitect.github.io.

4/11/2024

CityCraft: A Real Crafter for 3D City Generation

Jie Deng, Wenhao Chai, Junsheng Huang, Zhonghan Zhao, Qixuan Huang, Mingyan Gao, Jianshu Guo, Shengyu Hao, Wenhao Hu, Jenq-Neng Hwang, Xi Li, Gaoang Wang

City scene generation has gained significant attention in autonomous driving, smart city development, and traffic simulation. It helps enhance infrastructure planning and monitoring solutions. Existing methods have employed a two-stage process involving city layout generation, typically using Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), or Transformers, followed by neural rendering. These techniques often exhibit limited diversity and noticeable artifacts in the rendered city scenes. The rendered scenes lack variety, resembling the training images, resulting in monotonous styles. Additionally, these methods lack planning capabilities, leading to less realistic generated scenes. In this paper, we introduce CityCraft, an innovative framework designed to enhance both the diversity and quality of urban scene generation. Our approach integrates three key stages: initially, a diffusion transformer (DiT) model is deployed to generate diverse and controllable 2D city layouts. Subsequently, a Large Language Model(LLM) is utilized to strategically make land-use plans within these layouts based on user prompts and language guidelines. Based on the generated layout and city plan, we utilize the asset retrieval module and Blender for precise asset placement and scene construction. Furthermore, we contribute two new datasets to the field: 1)CityCraft-OSM dataset including 2D semantic layouts of urban areas, corresponding satellite images, and detailed annotations. 2) CityCraft-Buildings dataset, featuring thousands of diverse, high-quality 3D building assets. CityCraft achieves state-of-the-art performance in generating realistic 3D cities.

6/10/2024