GenesisTex2: Stable, Consistent and High-Quality Text-to-Texture Generation

Read original: arXiv:2409.18401 - Published 9/30/2024 by Jiawei Lu, Yingpeng Zhang, Zengjun Zhao, He Wang, Kun Zhou, Tianjia Shao

🛸

Overview

Large-scale text-guided image diffusion models have shown impressive results in text-to-image (T2I) generation.
Applying these models to synthesize textures for 3D geometries remains challenging due to the domain gap between 2D images and 3D surface textures.
Early works using a projecting-and-inpainting approach preserved generation diversity but resulted in noticeable artifacts and style inconsistencies.
Recent methods have attempted to address these inconsistencies, but often introduce other issues like blurring, over-saturation, or over-smoothing.

Plain English Explanation

The paper proposes a novel text-to-texture synthesis framework that builds upon pretrained diffusion models. The key ideas are:

Local Attention Reweighing: A mechanism is introduced to guide the model in focusing on spatially-correlated patches across different views, enhancing local details while preserving cross-view consistency.
Latent Space Merge Pipeline: A novel approach that ensures consistency across different viewpoints without sacrificing too much diversity.

This framework significantly outperforms existing state-of-the-art techniques in terms of texture consistency and visual quality, while delivering results much faster than distillation-based methods. Importantly, the framework does not require additional training or fine-tuning, making it highly adaptable to a wide range of publicly available models.

Technical Explanation

The paper proposes a novel text-to-texture synthesis framework that leverages pretrained diffusion models. The key technical contributions are:

Local Attention Reweighing: The authors introduce a mechanism to guide the model's self-attention layers in concentrating on spatially-correlated patches across different views. This enhances the preservation of local details while maintaining cross-view consistency.
Latent Space Merge Pipeline: The authors propose a novel approach to ensure consistency across different viewpoints without sacrificing too much diversity. This pipeline merges the latent representations from multiple viewpoints to produce the final texture.

The proposed framework significantly outperforms existing state-of-the-art techniques, such as GenesisTexT2 and TexPainter, in terms of texture consistency and visual quality. Additionally, it delivers results much faster than distillation-based methods like ViewDiff and InfiniteTexture.

Critical Analysis

The paper addresses an important challenge in the field of text-guided image synthesis, namely the domain gap between 2D images and 3D surface textures. The proposed framework provides a practical solution that outperforms existing methods in terms of both quality and efficiency.

One potential limitation is that the framework does not explicitly address issues like blurring, over-saturation, or over-smoothing, which were previously introduced by other methods. It would be valuable to further investigate the robustness of the framework in handling these types of artifacts.

Additionally, the paper does not provide a comprehensive evaluation of the framework's performance across a wide range of 3D geometries and texture types. Further research could explore the generalizability of the approach and its ability to handle more complex or diverse texturing scenarios.

Conclusion

The proposed text-to-texture synthesis framework leverages pretrained diffusion models to address the challenges of applying large-scale text-guided image generation to 3D surface texturing. The key innovations, including local attention reweighing and a novel latent space merge pipeline, enable the framework to significantly outperform existing state-of-the-art techniques in terms of texture consistency and visual quality, while also delivering results much faster.

This research represents an important step forward in bridging the gap between 2D image synthesis and 3D texture generation, with potential applications in areas such as computer graphics, virtual environments, and product design. The adaptability of the framework to a wide range of publicly available models further enhances its practicality and accessibility.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🛸

GenesisTex2: Stable, Consistent and High-Quality Text-to-Texture Generation

Jiawei Lu, Yingpeng Zhang, Zengjun Zhao, He Wang, Kun Zhou, Tianjia Shao

Large-scale text-guided image diffusion models have shown astonishing results in text-to-image (T2I) generation. However, applying these models to synthesize textures for 3D geometries remains challenging due to the domain gap between 2D images and textures on a 3D surface. Early works that used a projecting-and-inpainting approach managed to preserve generation diversity but often resulted in noticeable artifacts and style inconsistencies. While recent methods have attempted to address these inconsistencies, they often introduce other issues, such as blurring, over-saturation, or over-smoothing. To overcome these challenges, we propose a novel text-to-texture synthesis framework that leverages pretrained diffusion models. We first introduce a local attention reweighing mechanism in the self-attention layers to guide the model in concentrating on spatial-correlated patches across different views, thereby enhancing local details while preserving cross-view consistency. Additionally, we propose a novel latent space merge pipeline, which further ensures consistency across different viewpoints without sacrificing too much diversity. Our method significantly outperforms existing state-of-the-art techniques regarding texture consistency and visual quality, while delivering results much faster than distillation-based methods. Importantly, our framework does not require additional training or fine-tuning, making it highly adaptable to a wide range of models available on public platforms.

9/30/2024

TexPainter: Generative Mesh Texturing with Multi-view Consistency

Hongkun Zhang, Zherong Pan, Congyi Zhang, Lifeng Zhu, Xifeng Gao

The recent success of pre-trained diffusion models unlocks the possibility of the automatic generation of textures for arbitrary 3D meshes in the wild. However, these models are trained in the screen space, while converting them to a multi-view consistent texture image poses a major obstacle to the output quality. In this paper, we propose a novel method to enforce multi-view consistency. Our method is based on the observation that latent space in a pre-trained diffusion model is noised separately for each camera view, making it difficult to achieve multi-view consistency by directly manipulating the latent codes. Based on the celebrated Denoising Diffusion Implicit Models (DDIM) scheme, we propose to use an optimization-based color-fusion to enforce consistency and indirectly modify the latent codes by gradient back-propagation. Our method further relaxes the sequential dependency assumption among the camera views. By evaluating on a series of general 3D models, we find our simple approach improves consistency and overall quality of the generated textures as compared to competing state-of-the-arts. Our implementation is available at: https://github.com/Quantuman134/TexPainter

6/28/2024

Infinite Texture: Text-guided High Resolution Diffusion Texture Synthesis

Yifan Wang, Aleksander Holynski, Brian L. Curless, Steven M. Seitz

We present Infinite Texture, a method for generating arbitrarily large texture images from a text prompt. Our approach fine-tunes a diffusion model on a single texture, and learns to embed that statistical distribution in the output domain of the model. We seed this fine-tuning process with a sample texture patch, which can be optionally generated from a text-to-image model like DALL-E 2. At generation time, our fine-tuned diffusion model is used through a score aggregation strategy to generate output texture images of arbitrary resolution on a single GPU. We compare synthesized textures from our method to existing work in patch-based and deep learning texture synthesis methods. We also showcase two applications of our generated textures in 3D rendering and texture transfer.

5/15/2024

ViewDiff: 3D-Consistent Image Generation with Text-to-Image Models

Lukas Hollein, Aljav{z} Bov{z}iv{c}, Norman Muller, David Novotny, Hung-Yu Tseng, Christian Richardt, Michael Zollhofer, Matthias Nie{ss}ner

3D asset generation is getting massive amounts of attention, inspired by the recent success of text-guided 2D content creation. Existing text-to-3D methods use pretrained text-to-image diffusion models in an optimization problem or fine-tune them on synthetic data, which often results in non-photorealistic 3D objects without backgrounds. In this paper, we present a method that leverages pretrained text-to-image models as a prior, and learn to generate multi-view images in a single denoising process from real-world data. Concretely, we propose to integrate 3D volume-rendering and cross-frame-attention layers into each block of the existing U-Net network of the text-to-image model. Moreover, we design an autoregressive generation that renders more 3D-consistent images at any viewpoint. We train our model on real-world datasets of objects and showcase its capabilities to generate instances with a variety of high-quality shapes and textures in authentic surroundings. Compared to the existing methods, the results generated by our method are consistent, and have favorable visual quality (-30% FID, -37% KID).

7/30/2024