Text-Driven Diverse Facial Texture Generation via Progressive Latent-Space Refinement

2404.09540

Published 4/16/2024 by Chi Wang, Junming Huang, Rong Zhang, Qi Wang, Haotian Yang, Haibin Huang, Chongyang Ma, Weiwei Xu

cs.CV

Text-Driven Diverse Facial Texture Generation via Progressive Latent-Space Refinement

Abstract

Automatic 3D facial texture generation has gained significant interest recently. Existing approaches may not support the traditional physically based rendering pipeline or rely on 3D data captured by Light Stage. Our key contribution is a progressive latent space refinement approach that can bootstrap from 3D Morphable Models (3DMMs)-based texture maps generated from facial images to generate high-quality and diverse PBR textures, including albedo, normal, and roughness. It starts with enhancing Generative Adversarial Networks (GANs) for text-guided and diverse texture generation. To this end, we design a self-supervised paradigm to overcome the reliance on ground truth 3D textures and train the generative model with only entangled texture maps. Besides, we foster mutual enhancement between GANs and Score Distillation Sampling (SDS). SDS boosts GANs with more generative modes, while GANs promote more efficient optimization of SDS. Furthermore, we introduce an edge-aware SDS for multi-view consistent facial structure. Experiments demonstrate that our method outperforms existing 3D texture generation methods regarding photo-realistic quality, diversity, and efficiency.

Create account to get full access

Overview

This paper introduces a novel method for generating diverse facial textures based on text descriptions.
The approach uses a progressive latent-space refinement technique to generate high-quality, diverse facial textures that match the input text.
The method outperforms existing text-to-image and text-to-3D generation models, demonstrating its effectiveness at capturing the nuanced details of facial features.

Plain English Explanation

This research paper presents a new way to generate realistic-looking facial textures based on written descriptions. The key idea is to use a progressive latent-space refinement technique to gradually refine the facial textures until they closely match the input text.

For example, if you provide a text description like "a kind, elderly man with warm, wrinkled skin," the system would generate a variety of facial textures that match this description, capturing details like the skin texture, facial features, and overall appearance. This allows for the creation of diverse and personalized facial renderings from text inputs.

The researchers show that their approach outperforms existing text-to-image and text-to-3D generation models, demonstrating its ability to generate high-quality, nuanced facial textures that closely match the input descriptions.

Technical Explanation

The core of this paper is a novel text-driven facial texture generation approach that uses a progressive latent-space refinement technique. The method starts with a low-resolution facial texture and then gradually refines it in the latent space to generate high-quality, diverse facial textures that align with the input text description.

The key technical components include:

A text encoder that maps the input text description into a latent representation.
A facial texture generator that takes the text latent and a noise vector as input, and outputs a low-resolution facial texture.
A progressive refinement module that iteratively enhances the facial texture resolution and detail to match the text description.

The researchers evaluate their approach on several datasets and show that it outperforms state-of-the-art text-to-3D generation models in terms of visual quality and diversity of the generated facial textures.

Critical Analysis

The paper presents a compelling and technically sophisticated approach for generating diverse facial textures from text descriptions. The progressive latent-space refinement technique is a novel contribution that effectively captures the nuanced details of facial features.

However, the paper does not address some potential limitations or areas for further research. For example, the method may struggle with generating highly expressive or exaggerated facial textures, as it is focused on realistic facial characteristics. Additionally, the computational complexity of the progressive refinement process could limit its applicability to real-time or interactive applications.

Further research could explore ways to incorporate more creative or stylized facial texture generation, or to optimize the efficiency of the refinement process. Additionally, integrating the facial texture generation with 3D face modeling could lead to more comprehensive and immersive text-driven 3D character creation.

Conclusion

This paper presents a novel approach for generating diverse, high-quality facial textures from text descriptions. The key innovation is the use of a progressive latent-space refinement technique, which allows the system to capture the nuanced details of facial features and generate a wide variety of personalized renderings.

The results demonstrate the effectiveness of this approach, outperforming existing text-to-image and text-to-3D generation models. While the paper does not address all potential limitations, it represents an important step forward in the field of text-driven 3D content creation, with applications in areas such as digital entertainment, virtual avatars, and human-computer interaction.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

New!TexPainter: Generative Mesh Texturing with Multi-view Consistency

Hongkun Zhang, Zherong Pan, Congyi Zhang, Lifeng Zhu, Xifeng Gao

The recent success of pre-trained diffusion models unlocks the possibility of the automatic generation of textures for arbitrary 3D meshes in the wild. However, these models are trained in the screen space, while converting them to a multi-view consistent texture image poses a major obstacle to the output quality. In this paper, we propose a novel method to enforce multi-view consistency. Our method is based on the observation that latent space in a pre-trained diffusion model is noised separately for each camera view, making it difficult to achieve multi-view consistency by directly manipulating the latent codes. Based on the celebrated Denoising Diffusion Implicit Models (DDIM) scheme, we propose to use an optimization-based color-fusion to enforce consistency and indirectly modify the latent codes by gradient back-propagation. Our method further relaxes the sequential dependency assumption among the camera views. By evaluating on a series of general 3D models, we find our simple approach improves consistency and overall quality of the generated textures as compared to competing state-of-the-arts. Our implementation is available at: https://github.com/Quantuman134/TexPainter

6/28/2024

cs.CV cs.GR

DiffusionGAN3D: Boosting Text-guided 3D Generation and Domain Adaptation by Combining 3D GANs and Diffusion Priors

Biwen Lei, Kai Yu, Mengyang Feng, Miaomiao Cui, Xuansong Xie

Text-guided domain adaptation and generation of 3D-aware portraits find many applications in various fields. However, due to the lack of training data and the challenges in handling the high variety of geometry and appearance, the existing methods for these tasks suffer from issues like inflexibility, instability, and low fidelity. In this paper, we propose a novel framework DiffusionGAN3D, which boosts text-guided 3D domain adaptation and generation by combining 3D GANs and diffusion priors. Specifically, we integrate the pre-trained 3D generative models (e.g., EG3D) and text-to-image diffusion models. The former provides a strong foundation for stable and high-quality avatar generation from text. And the diffusion models in turn offer powerful priors and guide the 3D generator finetuning with informative direction to achieve flexible and efficient text-guided domain adaptation. To enhance the diversity in domain adaptation and the generation capability in text-to-avatar, we introduce the relative distance loss and case-specific learnable triplane respectively. Besides, we design a progressive texture refinement module to improve the texture quality for both tasks above. Extensive experiments demonstrate that the proposed framework achieves excellent results in both domain adaptation and text-to-avatar tasks, outperforming existing methods in terms of generation quality and efficiency. The project homepage is at https://younglbw.github.io/DiffusionGAN3D-homepage/.

4/15/2024

cs.CV

📈

Grounded Compositional and Diverse Text-to-3D with Pretrained Multi-View Diffusion Model

Xiaolong Li, Jiawei Mo, Ying Wang, Chethan Parameshwara, Xiaohan Fei, Ashwin Swaminathan, CJ Taylor, Zhuowen Tu, Paolo Favaro, Stefano Soatto

In this paper, we propose an effective two-stage approach named Grounded-Dreamer to generate 3D assets that can accurately follow complex, compositional text prompts while achieving high fidelity by using a pre-trained multi-view diffusion model. Multi-view diffusion models, such as MVDream, have shown to generate high-fidelity 3D assets using score distillation sampling (SDS). However, applied naively, these methods often fail to comprehend compositional text prompts, and may often entirely omit certain subjects or parts. To address this issue, we first advocate leveraging text-guided 4-view images as the bottleneck in the text-to-3D pipeline. We then introduce an attention refocusing mechanism to encourage text-aligned 4-view image generation, without the necessity to re-train the multi-view diffusion model or craft a high-quality compositional 3D dataset. We further propose a hybrid optimization strategy to encourage synergy between the SDS loss and the sparse RGB reference images. Our method consistently outperforms previous state-of-the-art (SOTA) methods in generating compositional 3D assets, excelling in both quality and accuracy, and enabling diverse 3D from the same text prompt.

4/30/2024

cs.CV cs.AI

🛸

Direct3D: Scalable Image-to-3D Generation via 3D Latent Diffusion Transformer

Shuang Wu, Youtian Lin, Feihu Zhang, Yifei Zeng, Jingxi Xu, Philip Torr, Xun Cao, Yao Yao

Generating high-quality 3D assets from text and images has long been challenging, primarily due to the absence of scalable 3D representations capable of capturing intricate geometry distributions. In this work, we introduce Direct3D, a native 3D generative model scalable to in-the-wild input images, without requiring a multiview diffusion model or SDS optimization. Our approach comprises two primary components: a Direct 3D Variational Auto-Encoder (D3D-VAE) and a Direct 3D Diffusion Transformer (D3D-DiT). D3D-VAE efficiently encodes high-resolution 3D shapes into a compact and continuous latent triplane space. Notably, our method directly supervises the decoded geometry using a semi-continuous surface sampling strategy, diverging from previous methods relying on rendered images as supervision signals. D3D-DiT models the distribution of encoded 3D latents and is specifically designed to fuse positional information from the three feature maps of the triplane latent, enabling a native 3D generative model scalable to large-scale 3D datasets. Additionally, we introduce an innovative image-to-3D generation pipeline incorporating semantic and pixel-level image conditions, allowing the model to produce 3D shapes consistent with the provided conditional image input. Extensive experiments demonstrate the superiority of our large-scale pre-trained Direct3D over previous image-to-3D approaches, achieving significantly better generation quality and generalization ability, thus establishing a new state-of-the-art for 3D content creation. Project page: https://nju-3dv.github.io/projects/Direct3D/.

6/4/2024

cs.CV