MaTe3D: Mask-guided Text-based 3D-aware Portrait Editing

2312.06947

Published 5/6/2024 by Kangneng Zhou, Daiheng Gao, Xuan Wang, Jie Zhang, Peng Zhang, Xusen Sun, Longhao Zhang, Shiqi Yang, Bang Zhang, Liefeng Bo and 2 others

cs.CV

MaTe3D: Mask-guided Text-based 3D-aware Portrait Editing

Abstract

3D-aware portrait editing has a wide range of applications in multiple fields. However, current approaches are limited due that they can only perform mask-guided or text-based editing. Even by fusing the two procedures into a model, the editing quality and stability cannot be ensured. To address this limitation, we propose textbf{MaTe3D}: mask-guided text-based 3D-aware portrait editing. In this framework, first, we introduce a new SDF-based 3D generator which learns local and global representations with proposed SDF and density consistency losses. This enhances masked-based editing in local areas; second, we present a novel distillation strategy: Conditional Distillation on Geometry and Texture (CDGT). Compared to exiting distillation strategies, it mitigates visual ambiguity and avoids mismatch between texture and geometry, thereby producing stable texture and convincing geometry while editing. Additionally, we create the CatMask-HQ dataset, a large-scale high-resolution cat face annotation for exploration of model generalization and expansion. We perform expensive experiments on both the FFHQ and CatMask-HQ datasets to demonstrate the editing quality and stability of the proposed method. Our method faithfully generates a 3D-aware edited face image based on a modified mask and a text prompt. Our code and models will be publicly released.

Create account to get full access

Overview

The paper "MaTe3D: Mask-guided Text-based 3D-aware Portrait Editing" presents a novel approach to 3D-aware portrait editing using text-based guidance and mask-based image manipulation.
The proposed MaTe3D model allows users to edit 3D portrait images by simply providing text descriptions, leveraging a mask-guided generation process to produce realistic and coherent results.
This work builds upon and extends previous research on text-guided 3D generation and text-driven diverse facial texture generation.

Plain English Explanation

The paper introduces a new way to edit 3D portraits using text descriptions. Typically, editing 3D portraits requires specialized skills and software, but the MaTe3D model makes it much simpler.

With MaTe3D, users can describe the changes they want to make to a 3D portrait, such as changing the person's hairstyle or adding facial features, and the model will automatically generate the edited 3D portrait. The key innovation is the use of a "mask" - a way to precisely specify which parts of the portrait should be edited based on the text description.

This allows for more realistic and coherent edits compared to previous text-based 3D generation approaches, which had difficulty maintaining the overall structure and realism of the portrait. By guiding the editing process with masks, the MaTe3D model can produce high-quality, natural-looking 3D portrait edits from just text input.

Technical Explanation

The MaTe3D model consists of a text-to-3D generator that takes in a text description and a source 3D portrait, and outputs an edited 3D portrait. This generator is guided by a mask-based image manipulation module that uses the text description to identify which regions of the 3D portrait should be modified.

The key technical innovations include:

A multi-modal encoder that jointly encodes the text description and the source 3D portrait
A mask-guided 3D portrait editing module that applies the text-based edits in a spatially-aware manner
A training process that leverages both 3D portrait data and 2D mask-based editing pairs to improve the model's ability to generate realistic and coherent 3D edits

Through extensive experiments, the authors demonstrate that MaTe3D outperforms previous text-to-3D generation methods in terms of both edit quality and alignment with the text descriptions.

Critical Analysis

The MaTe3D paper introduces a promising approach to 3D portrait editing that addresses several limitations of prior work. By incorporating mask-based guidance, the model is able to make more precise and natural-looking edits compared to purely text-driven 3D generation.

However, the paper does not explore the limits of the model's capabilities, such as the types of edits it can handle, the level of detail it can achieve, or the robustness to diverse portrait inputs. Additionally, the training dataset and evaluation metrics could be expanded to better assess the model's performance in real-world scenarios.

Further research could also investigate ways to make the editing process more interactive and intuitive for users, perhaps by allowing iterative refinement of the text descriptions or the mask annotations. Integrating the MaTe3D model into a end-to-end 3D portrait editing system would also be an interesting direction to explore.

Overall, the MaTe3D paper represents an important step forward in making 3D portrait editing more accessible and versatile through the use of text-based and mask-guided generation techniques.

Conclusion

The "MaTe3D: Mask-guided Text-based 3D-aware Portrait Editing" paper presents a novel approach to 3D portrait editing that allows users to make changes to 3D portraits simply by providing text descriptions. By leveraging a mask-guided generation process, the MaTe3D model is able to produce realistic and coherent edits that align well with the user's intent.

This work builds upon and extends previous research on text-guided 3D generation and text-driven facial texture editing, demonstrating the potential of combining these techniques to enable more accessible and powerful 3D portrait manipulation tools. While there are still opportunities for further improvement and exploration, the MaTe3D model represents an important step forward in making 3D editing more intuitive and accessible to a wider range of users.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Real-time 3D-aware Portrait Editing from a Single Image

Qingyan Bai, Zifan Shi, Yinghao Xu, Hao Ouyang, Qiuyu Wang, Ceyuan Yang, Xuan Wang, Gordon Wetzstein, Yujun Shen, Qifeng Chen

This work presents 3DPE, a practical method that can efficiently edit a face image following given prompts, like reference images or text descriptions, in a 3D-aware manner. To this end, a lightweight module is distilled from a 3D portrait generator and a text-to-image model, which provide prior knowledge of face geometry and superior editing capability, respectively. Such a design brings two compelling advantages over existing approaches. First, our system achieves real-time editing with a feedforward network (i.e., ~0.04s per image), over 100x faster than the second competitor. Second, thanks to the powerful priors, our module could focus on the learning of editing-related variations, such that it manages to handle various types of editing simultaneously in the training phase and further supports fast adaptation to user-specified customized types of editing during inference (e.g., with ~5min fine-tuning per style). The code, the model, and the interface will be made publicly available to facilitate future research.

4/3/2024

cs.CV

Fast Text-to-3D-Aware Face Generation and Manipulation via Direct Cross-modal Mapping and Geometric Regularization

Jinlu Zhang, Yiyi Zhou, Qiancheng Zheng, Xiaoxiong Du, Gen Luo, Jun Peng, Xiaoshuai Sun, Rongrong Ji

Text-to-3D-aware face (T3D Face) generation and manipulation is an emerging research hot spot in machine learning, which still suffers from low efficiency and poor quality. In this paper, we propose an End-to-End Efficient and Effective network for fast and accurate T3D face generation and manipulation, termed $E^3$-FaceNet. Different from existing complex generation paradigms, $E^3$-FaceNet resorts to a direct mapping from text instructions to 3D-aware visual space. We introduce a novel Style Code Enhancer to enhance cross-modal semantic alignment, alongside an innovative Geometric Regularization objective to maintain consistency across multi-view generations. Extensive experiments on three benchmark datasets demonstrate that $E^3$-FaceNet can not only achieve picture-like 3D face generation and manipulation, but also improve inference speed by orders of magnitudes. For instance, compared with Latent3D, $E^3$-FaceNet speeds up the five-view generations by almost 470 times, while still exceeding in generation quality. Our code is released at https://github.com/Aria-Zhangjl/E3-FaceNet.

5/27/2024

cs.CV

Portrait3D: 3D Head Generation from Single In-the-wild Portrait Image

Jinkun Hao, Junshu Tang, Jiangning Zhang, Ran Yi, Yijia Hong, Moran Li, Weijian Cao, Yating Wang, Lizhuang Ma

While recent works have achieved great success on one-shot 3D common object generation, high quality and fidelity 3D head generation from a single image remains a great challenge. Previous text-based methods for generating 3D heads were limited by text descriptions and image-based methods struggled to produce high-quality head geometry. To handle this challenging problem, we propose a novel framework, Portrait3D, to generate high-quality 3D heads while preserving their identities. Our work incorporates the identity information of the portrait image into three parts: 1) geometry initialization, 2) geometry sculpting, and 3) texture generation stages. Given a reference portrait image, we first align the identity features with text features to realize ID-aware guidance enhancement, which contains the control signals representing the face information. We then use the canny map, ID features of the portrait image, and a pre-trained text-to-normal/depth diffusion model to generate ID-aware geometry supervision, and 3D-GAN inversion is employed to generate ID-aware geometry initialization. Furthermore, with the ability to inject identity information into 3D head generation, we use ID-aware guidance to calculate ID-aware Score Distillation (ISD) for geometry sculpting. For texture generation, we adopt the ID Consistent Texture Inpainting and Refinement which progressively expands the view for texture inpainting to obtain an initialization UV texture map. We then use the id-aware guidance to provide image-level supervision for noisy multi-view images to obtain a refined texture map. Extensive experiments demonstrate that we can generate high-quality 3D heads with accurate geometry and texture from single in-the-wild portrait images. The project page is at https://jinkun-hao.github.io/Portrait3D/.

6/26/2024

cs.CV

Portrait3D: Text-Guided High-Quality 3D Portrait Generation Using Pyramid Representation and GANs Prior

Yiqian Wu, Hao Xu, Xiangjun Tang, Xien Chen, Siyu Tang, Zhebin Zhang, Chen Li, Xiaogang Jin

Existing neural rendering-based text-to-3D-portrait generation methods typically make use of human geometry prior and diffusion models to obtain guidance. However, relying solely on geometry information introduces issues such as the Janus problem, over-saturation, and over-smoothing. We present Portrait3D, a novel neural rendering-based framework with a novel joint geometry-appearance prior to achieve text-to-3D-portrait generation that overcomes the aforementioned issues. To accomplish this, we train a 3D portrait generator, 3DPortraitGAN-Pyramid, as a robust prior. This generator is capable of producing 360{deg} canonical 3D portraits, serving as a starting point for the subsequent diffusion-based generation process. To mitigate the grid-like artifact caused by the high-frequency information in the feature-map-based 3D representation commonly used by most 3D-aware GANs, we integrate a novel pyramid tri-grid 3D representation into 3DPortraitGAN-Pyramid. To generate 3D portraits from text, we first project a randomly generated image aligned with the given prompt into the pre-trained 3DPortraitGAN-Pyramid's latent space. The resulting latent code is then used to synthesize a pyramid tri-grid. Beginning with the obtained pyramid tri-grid, we use score distillation sampling to distill the diffusion model's knowledge into the pyramid tri-grid. Following that, we utilize the diffusion model to refine the rendered images of the 3D portrait and then use these refined images as training data to further optimize the pyramid tri-grid, effectively eliminating issues with unrealistic color and unnatural artifacts. Our experimental results show that Portrait3D can produce realistic, high-quality, and canonical 3D portraits that align with the prompt.

4/17/2024

cs.CV