TexPainter: Generative Mesh Texturing with Multi-view Consistency

2406.18539

Published 6/28/2024 by Hongkun Zhang, Zherong Pan, Congyi Zhang, Lifeng Zhu, Xifeng Gao

TexPainter: Generative Mesh Texturing with Multi-view Consistency

Abstract

The recent success of pre-trained diffusion models unlocks the possibility of the automatic generation of textures for arbitrary 3D meshes in the wild. However, these models are trained in the screen space, while converting them to a multi-view consistent texture image poses a major obstacle to the output quality. In this paper, we propose a novel method to enforce multi-view consistency. Our method is based on the observation that latent space in a pre-trained diffusion model is noised separately for each camera view, making it difficult to achieve multi-view consistency by directly manipulating the latent codes. Based on the celebrated Denoising Diffusion Implicit Models (DDIM) scheme, we propose to use an optimization-based color-fusion to enforce consistency and indirectly modify the latent codes by gradient back-propagation. Our method further relaxes the sequential dependency assumption among the camera views. By evaluating on a series of general 3D models, we find our simple approach improves consistency and overall quality of the generated textures as compared to competing state-of-the-arts. Our implementation is available at: https://github.com/Quantuman134/TexPainter

Create account to get full access

Related Work

Text-Guided Texture Generation

Prior work has explored using text to guide the generation of 2D textures. For example, Text-Driven Diverse Facial Texture Generation via Latent Diffusion used a diffusion model to generate diverse facial textures from text prompts. Similarly, Single Mesh Diffusion Models: From Field Latents to Textured 3D Meshes generated textured 3D meshes by conditioning a diffusion model on text.

3D Texture Synthesis

Researchers have also developed techniques for synthesizing textures directly on 3D mesh surfaces. RoomTex: Texturing Compositional Indoor Scenes via Iterative Layout-to-Texture Generation used a generative model to paint textures on 3D scene layouts. Consistent and Fast 3D Painting via Latent Consistency generated consistent textures across multiple views of a 3D mesh.

Diffusion Models for 3D

Recent work has explored using diffusion models to generate 3D content. Grounded Compositional Diverse Text-to-3D Pretrained used a diffusion model to generate 3D shapes from text prompts. The current paper, TexPainter, builds on these techniques to generate textured 3D meshes in a multi-view consistent manner.

Plain English Explanation

The paper TexPainter explores a new way to automatically generate textured 3D models from text descriptions. Previous work has shown how text can be used to guide the generation of 2D textures or to synthesize textures directly on 3D mesh surfaces. TexPainter combines these ideas, using a diffusion model to generate 3D meshes with consistent textures across multiple views.

A diffusion model is a type of machine learning model that can generate new data by learning from examples. In the case of TexPainter, the model learns to generate textured 3D meshes by analyzing a large dataset of 3D models with associated text descriptions. Once trained, the model can then generate new 3D meshes with textures that match a given text prompt.

The key innovation of TexPainter is its ability to generate textures that are consistent across multiple views of the 3D mesh. This is important for creating realistic 3D models that can be viewed from different angles. TexPainter achieves this by conditioning the diffusion model on information about the 3D geometry and using techniques to enforce consistency during the generation process.

Overall, TexPainter represents an exciting step forward in the field of text-guided 3D content generation. By making it easier to create high-quality, multi-view consistent textured 3D models, it could have applications in areas like virtual reality, gaming, and design.

Technical Explanation

TexPainter is a generative model for creating textured 3D meshes from text descriptions. The system uses a diffusion model architecture, which is a type of generative model that learns to generate new data by gradually adding noise to a latent representation and then learning to reverse the process.

The key innovation of TexPainter is its ability to generate textures that are consistent across multiple views of the 3D mesh. To achieve this, the model takes as input not only the text prompt but also information about the 3D geometry, such as normal vectors and depth maps. The model then learns to generate a latent representation of the textured mesh that is consistent with this multi-view geometric information.

During training, the model is exposed to a dataset of 3D meshes with associated text descriptions. The model learns to map from the text prompt and geometric features to a consistent latent representation of the textured mesh. At inference time, the model can then generate new textured meshes by sampling from this learned latent space.

The authors evaluate TexPainter on a challenging dataset of 3D meshes and show that it outperforms previous approaches in terms of both texture quality and view consistency. They also demonstrate that the model can be used to generate diverse and creative textured 3D content from a wide range of text prompts.

Critical Analysis

One key strength of TexPainter is its ability to generate textured 3D meshes that maintain consistency across multiple views. This is an important capability for creating realistic 3D content that can be viewed from different angles. The authors' use of multi-view geometric information to condition the diffusion model is a clever and effective approach to achieving this.

However, the paper does not address some potential limitations and areas for further research. For example, the dataset used for training is relatively small and may not capture the full diversity of 3D textured meshes that exist in the real world. Additionally, the paper does not explore the model's robustness to challenging geometric features, such as thin or complex structures, which could be an important consideration for real-world applications.

Another area for potential improvement is the model's ability to generate consistent textures at different scales. While the paper demonstrates good results for generating consistent textures within a single view, it is not clear how well the model would perform when scaling the textured mesh or generating textures at different resolutions.

Overall, TexPainter represents an exciting advance in the field of text-guided 3D content generation. By combining diffusion models with multi-view geometric information, the authors have created a system that can generate high-quality, consistent textured 3D meshes. However, further research is needed to fully explore the capabilities and limitations of this approach, particularly in the context of real-world applications.

Conclusion

TexPainter is a novel generative model that can create textured 3D meshes from text descriptions, with the key innovation being its ability to generate consistent textures across multiple views of the 3D model. By conditioning the diffusion-based generation process on multi-view geometric information, the system is able to produce realistic and coherent textured 3D content.

This work represents an important step forward in the field of text-guided 3D content generation, with potential applications in areas like virtual reality, gaming, and design. While the paper highlights some impressive results, it also identifies opportunities for further research, such as exploring the model's robustness to complex geometric features and its ability to generate consistent textures at different scales.

Overall, TexPainter demonstrates the power of combining diffusion models with multi-view geometric information to tackle the challenge of generating high-quality, consistent textured 3D meshes from text prompts. As the field of 3D content generation continues to evolve, this research could serve as a valuable foundation for developing even more advanced and versatile text-guided 3D modeling tools.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Consistency^2: Consistent and Fast 3D Painting with Latent Consistency Models

Tianfu Wang, Anton Obukhov, Konrad Schindler

Generative 3D Painting is among the top productivity boosters in high-resolution 3D asset management and recycling. Ever since text-to-image models became accessible for inference on consumer hardware, the performance of 3D Painting methods has consistently improved and is currently close to plateauing. At the core of most such models lies denoising diffusion in the latent space, an inherently time-consuming iterative process. Multiple techniques have been developed recently to accelerate generation and reduce sampling iterations by orders of magnitude. Designed for 2D generative imaging, these techniques do not come with recipes for lifting them into 3D. In this paper, we address this shortcoming by proposing a Latent Consistency Model (LCM) adaptation for the task at hand. We analyze the strengths and weaknesses of the proposed model and evaluate it quantitatively and qualitatively. Based on the Objaverse dataset samples study, our 3D painting method attains strong preference in all evaluations. Source code is available at https://github.com/kongdai123/consistency2.

6/18/2024

cs.CV cs.GR

Text-Driven Diverse Facial Texture Generation via Progressive Latent-Space Refinement

Chi Wang, Junming Huang, Rong Zhang, Qi Wang, Haotian Yang, Haibin Huang, Chongyang Ma, Weiwei Xu

Automatic 3D facial texture generation has gained significant interest recently. Existing approaches may not support the traditional physically based rendering pipeline or rely on 3D data captured by Light Stage. Our key contribution is a progressive latent space refinement approach that can bootstrap from 3D Morphable Models (3DMMs)-based texture maps generated from facial images to generate high-quality and diverse PBR textures, including albedo, normal, and roughness. It starts with enhancing Generative Adversarial Networks (GANs) for text-guided and diverse texture generation. To this end, we design a self-supervised paradigm to overcome the reliance on ground truth 3D textures and train the generative model with only entangled texture maps. Besides, we foster mutual enhancement between GANs and Score Distillation Sampling (SDS). SDS boosts GANs with more generative modes, while GANs promote more efficient optimization of SDS. Furthermore, we introduce an edge-aware SDS for multi-view consistent facial structure. Experiments demonstrate that our method outperforms existing 3D texture generation methods regarding photo-realistic quality, diversity, and efficiency.

4/16/2024

cs.CV

Single Mesh Diffusion Models with Field Latents for Texture Generation

Thomas W. Mitchel, Carlos Esteves, Ameesh Makadia

We introduce a framework for intrinsic latent diffusion models operating directly on the surfaces of 3D shapes, with the goal of synthesizing high-quality textures. Our approach is underpinned by two contributions: field latents, a latent representation encoding textures as discrete vector fields on the mesh vertices, and field latent diffusion models, which learn to denoise a diffusion process in the learned latent space on the surface. We consider a single-textured-mesh paradigm, where our models are trained to generate variations of a given texture on a mesh. We show the synthesized textures are of superior fidelity compared those from existing single-textured-mesh generative models. Our models can also be adapted for user-controlled editing tasks such as inpainting and label-guided generation. The efficacy of our approach is due in part to the equivariance of our proposed framework under isometries, allowing our models to seamlessly reproduce details across locally similar regions and opening the door to a notion of generative texture transfer.

5/30/2024

cs.CV cs.GR cs.LG

RoomTex: Texturing Compositional Indoor Scenes via Iterative Inpainting

Qi Wang, Ruijie Lu, Xudong Xu, Jingbo Wang, Michael Yu Wang, Bo Dai, Gang Zeng, Dan Xu

The advancement of diffusion models has pushed the boundary of text-to-3D object generation. While it is straightforward to composite objects into a scene with reasonable geometry, it is nontrivial to texture such a scene perfectly due to style inconsistency and occlusions between objects. To tackle these problems, we propose a coarse-to-fine 3D scene texturing framework, referred to as RoomTex, to generate high-fidelity and style-consistent textures for untextured compositional scene meshes. In the coarse stage, RoomTex first unwraps the scene mesh to a panoramic depth map and leverages ControlNet to generate a room panorama, which is regarded as the coarse reference to ensure the global texture consistency. In the fine stage, based on the panoramic image and perspective depth maps, RoomTex will refine and texture every single object in the room iteratively along a series of selected camera views, until this object is completely painted. Moreover, we propose to maintain superior alignment between RGB and depth spaces via subtle edge detection methods. Extensive experiments show our method is capable of generating high-quality and diverse room textures, and more importantly, supporting interactive fine-grained texture control and flexible scene editing thanks to our inpainting-based framework and compositional mesh input. Our project page is available at https://qwang666.github.io/RoomTex/.

6/5/2024

cs.CV