StyleCity: Large-Scale 3D Urban Scenes Stylization with Vision-and-Text Reference via Progressive Optimization

Read original: arXiv:2404.10681 - Published 7/17/2024 by Yingshu Chen, Huajian Huang, Tuan-Anh Vu, Ka Chun Shum, Sai-Kit Yeung

StyleCity: Large-Scale 3D Urban Scenes Stylization with Vision-and-Text Reference via Progressive Optimization

Overview

This research paper introduces a new method called "StyleCity" for large-scale 3D urban scene stylization using a combination of visual and textual information.
The proposed approach allows users to transform 3D city models into various artistic styles, such as oil painting, watercolor, or cartoon, by providing reference images and text descriptions.
The system progressively optimizes the 3D scene geometry and materials to match the desired visual style while preserving the overall structure and functionality of the urban environment.

Plain English Explanation

The paper introduces a new technique called "StyleCity" that can transform 3D models of entire cities into different artistic styles. For example, you could take a realistic 3D model of a city and turn it into an oil painting-style or cartoon-style version of the same city.

The key innovation is that StyleCity uses both visual references (like example images) and text descriptions to guide the stylization process. You can provide the system with an image that represents the artistic style you want, as well as a text description of the desired look and feel, and it will automatically adjust the 3D city model to match that style.

The system works by progressively optimizing the geometry and materials of the 3D scene to match the provided visual and textual references. This allows it to preserve the overall structure and functionality of the original city model while applying the desired artistic flair.

Technical Explanation

The StyleCity system takes a 3D city model as input and uses a combination of visual and textual information to transform it into various artistic styles. The visual information comes from example reference images that depict the desired style, such as oil painting, watercolor, or cartoon. The textual information is provided in the form of descriptive captions that further specify the target aesthetic.

The system's architecture consists of several key components:

3D Geometry and Material Optimization: StyleCity optimizes the geometry and material properties of the 3D city model to match the visual and textual style references. This is done through a progressive optimization process that gradually adjusts the scene elements to achieve the desired artistic look.
Multi-Modal Style Encoding: The system encodes both the visual and textual style references into a joint latent representation, allowing the optimization process to consider both modalities simultaneously.
Style-Aware Rendering: The final stylized 3D scene is rendered using a style-aware rendering pipeline that applies the appropriate artistic effects to the geometry and materials.

The researchers evaluate StyleCity on a variety of 3D city models and demonstrate its ability to generate high-quality stylized scenes that closely match the provided visual and textual references. The system is shown to outperform previous approaches that rely on only visual or textual information for 3D scene stylization.

Critical Analysis

The StyleCity paper presents a compelling approach for transforming large-scale 3D urban environments into various artistic styles. The key strength of the method is its ability to leverage both visual and textual information to guide the stylization process, allowing for more nuanced and expressive results compared to previous techniques that relied on a single modality.

However, the paper does acknowledge some limitations of the current system. For example, the optimization process can be computationally intensive, especially for very large and complex city models. Additionally, the system may struggle to faithfully reproduce certain artistic styles that require more advanced rendering techniques or material properties not fully captured by the current optimization framework.

Further research could explore ways to improve the efficiency and generalization capabilities of the StyleCity approach, such as investigating more advanced optimization algorithms or incorporating additional modalities (e.g., 3D style exemplars) to enrich the stylization process. Exploring the application of StyleCity to other types of 3D scenes beyond urban environments could also be a fruitful area for future work.

Overall, the StyleCity paper presents a promising step forward in the field of large-scale 3D scene stylization, demonstrating the potential of combining visual and textual information to enable more expressive and user-friendly 3D content creation tools.

Conclusion

The StyleCity paper introduces a novel approach for transforming 3D urban scenes into various artistic styles by leveraging both visual and textual reference information. The system's ability to progressively optimize the 3D geometry and materials to match the desired style while preserving the overall structure and functionality of the city model is a significant advancement in the field of 3D scene stylization.

The research demonstrates the potential of combining multiple modalities to enable more expressive and user-friendly 3D content creation tools, which could have applications in fields such as architecture, urban planning, and digital art. While the current system has some limitations, the promising results and the paper's discussion of future research directions suggest that further advancements in this area are likely to have a meaningful impact on how we create and interact with 3D virtual environments.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

StyleCity: Large-Scale 3D Urban Scenes Stylization with Vision-and-Text Reference via Progressive Optimization

Yingshu Chen, Huajian Huang, Tuan-Anh Vu, Ka Chun Shum, Sai-Kit Yeung

Creating large-scale virtual urban scenes with variant styles is inherently challenging. To facilitate prototypes of virtual production and bypass the need for complex materials and lighting setups, we introduce the first vision-and-text-driven texture stylization system for large-scale urban scenes, StyleCity. Taking an image and text as references, StyleCity stylizes a 3D textured mesh of a large-scale urban scene in a semantics-aware fashion and generates a harmonic omnidirectional sky background. To achieve that, we propose to stylize a neural texture field by transferring 2D vision-and-text priors to 3D globally and locally. During 3D stylization, we progressively scale the planned training views of the input 3D scene at different levels in order to preserve high-quality scene content. We then optimize the scene style globally by adapting the scale of the style image with the scale of the training views. Moreover, we enhance local semantics consistency by the semantics-aware style loss which is crucial for photo-realistic stylization. Besides texture stylization, we further adopt a generative diffusion model to synthesize a style-consistent omnidirectional sky image, which offers a more immersive atmosphere and assists the semantic stylization process. The stylized neural texture field can be baked into an arbitrary-resolution texture, enabling seamless integration into conventional rendering pipelines and significantly easing the virtual production prototyping process. Extensive experiments demonstrate our stylized scenes' superiority in qualitative and quantitative performance and user preferences.

7/17/2024

Urban Architect: Steerable 3D Urban Scene Generation with Layout Prior

Fan Lu, Kwan-Yee Lin, Yan Xu, Hongsheng Li, Guang Chen, Changjun Jiang

Text-to-3D generation has achieved remarkable success via large-scale text-to-image diffusion models. Nevertheless, there is no paradigm for scaling up the methodology to urban scale. Urban scenes, characterized by numerous elements, intricate arrangement relationships, and vast scale, present a formidable barrier to the interpretability of ambiguous textual descriptions for effective model optimization. In this work, we surmount the limitations by introducing a compositional 3D layout representation into text-to-3D paradigm, serving as an additional prior. It comprises a set of semantic primitives with simple geometric structures and explicit arrangement relationships, complementing textual descriptions and enabling steerable generation. Upon this, we propose two modifications -- (1) We introduce Layout-Guided Variational Score Distillation to address model optimization inadequacies. It conditions the score distillation sampling process with geometric and semantic constraints of 3D layouts. (2) To handle the unbounded nature of urban scenes, we represent 3D scene with a Scalable Hash Grid structure, incrementally adapting to the growing scale of urban scenes. Extensive experiments substantiate the capability of our framework to scale text-to-3D generation to large-scale urban scenes that cover over 1000m driving distance for the first time. We also present various scene editing demonstrations, showing the powers of steerable urban scene generation. Website: https://urbanarchitect.github.io.

4/11/2024

Dream-in-Style: Text-to-3D Generation using Stylized Score Distillation

Hubert Kompanowski, Binh-Son Hua

We present a method to generate 3D objects in styles. Our method takes a text prompt and a style reference image as input and reconstructs a neural radiance field to synthesize a 3D model with the content aligning with the text prompt and the style following the reference image. To simultaneously generate the 3D object and perform style transfer in one go, we propose a stylized score distillation loss to guide a text-to-3D optimization process to output visually plausible geometry and appearance. Our stylized score distillation is based on a combination of an original pretrained text-to-image model and its modified sibling with the key and value features of self-attention layers manipulated to inject styles from the reference image. Comparisons with state-of-the-art methods demonstrated the strong visual performance of our method, further supported by the quantitative results from our user study.

6/28/2024

Visual Text Generation in the Wild

Yuanzhi Zhu, Jiawei Liu, Feiyu Gao, Wenyu Liu, Xinggang Wang, Peng Wang, Fei Huang, Cong Yao, Zhibo Yang

Recently, with the rapid advancements of generative models, the field of visual text generation has witnessed significant progress. However, it is still challenging to render high-quality text images in real-world scenarios, as three critical criteria should be satisfied: (1) Fidelity: the generated text images should be photo-realistic and the contents are expected to be the same as specified in the given conditions; (2) Reasonability: the regions and contents of the generated text should cohere with the scene; (3) Utility: the generated text images can facilitate related tasks (e.g., text detection and recognition). Upon investigation, we find that existing methods, either rendering-based or diffusion-based, can hardly meet all these aspects simultaneously, limiting their application range. Therefore, we propose in this paper a visual text generator (termed SceneVTG), which can produce high-quality text images in the wild. Following a two-stage paradigm, SceneVTG leverages a Multimodal Large Language Model to recommend reasonable text regions and contents across multiple scales and levels, which are used by a conditional diffusion model as conditions to generate text images. Extensive experiments demonstrate that the proposed SceneVTG significantly outperforms traditional rendering-based methods and recent diffusion-based methods in terms of fidelity and reasonability. Besides, the generated images provide superior utility for tasks involving text detection and text recognition. Code and datasets are available at AdvancedLiterateMachinery.

7/22/2024