ShapeFusion: A 3D diffusion model for localized shape editing

2403.19773

Published 4/5/2024 by Rolandos Alexandros Potamias, Michail Tarasiou, Stylianos Ploumpis, Stefanos Zafeiriou

ShapeFusion: A 3D diffusion model for localized shape editing

Abstract

In the realm of 3D computer vision, parametric models have emerged as a ground-breaking methodology for the creation of realistic and expressive 3D avatars. Traditionally, they rely on Principal Component Analysis (PCA), given its ability to decompose data to an orthonormal space that maximally captures shape variations. However, due to the orthogonality constraints and the global nature of PCA's decomposition, these models struggle to perform localized and disentangled editing of 3D shapes, which severely affects their use in applications requiring fine control such as face sculpting. In this paper, we leverage diffusion models to enable diverse and fully localized edits on 3D meshes, while completely preserving the un-edited regions. We propose an effective diffusion masking training strategy that, by design, facilitates localized manipulation of any shape region, without being limited to predefined regions or to sparse sets of predefined control vertices. Following our framework, a user can explicitly set their manipulation region of choice and define an arbitrary set of vertices as handles to edit a 3D mesh. Compared to the current state-of-the-art our method leads to more interpretable shape manipulations than methods relying on latent code state, greater localization and generation diversity while offering faster inference than optimization based approaches. Project page: https://rolpotamias.github.io/Shapefusion/

Get summaries of the top AI research delivered straight to your inbox:

Introduction

This paper discusses the importance of 3D human bodies and faces in modern digital avatar applications, such as gaming, graphics, and virtual reality. Over the past decade, several methods have been developed to model 3D humans, with parametric models being among the best performing ones. Parametric and 3D morphable models (3DMMs) project 3D shapes into compact, low-dimensional latent representations, usually via PCA, to efficiently capture the essential characteristics and variations of the human shape.

Recent methods have shown that non-linear models, such as Graph Neural Networks and implicit functions, can further improve the modeling of 3D shapes. However, all these methods share a common limitation: an entangled latent space that hinders localized editing. Due to the entangled latent space, parametric models are not human-interpretable, making it difficult to identify latent codes that can control region-specific features.

Figure 1: Illustration of the properties of the proposed method for localized editing (Top) and region sampling (Bottom). Top: The proposed method can manipulate any region of a mesh by simply setting a user-defined anchor point and its surrounding region. The manipulations are completely disentangled and affect only the selected region. The disentanglement of the manipulations is illustrated using the color-coded distances from the previous manipulation step. Bottom: The proposed method can also sample new face parts and expressions by simply defining a mask over the desired region.

The paper introduces a new technique for localized 3D human modeling using diffusion models. The proposed method extends prior work on point cloud diffusion models to 3D meshes using a geometry-aware embedding layer. The task of localized shape modeling is formulated as an inpainting problem using a masking training approach, where the diffusion process acts locally on the masked regions.

The masking strategy enables learning of local topological features that facilitate manipulation directly on the vertex space and guarantees the disentanglement of the masked from the unmasked regions. The approach also enables conditioned local editing and sculpting by selecting arbitrary anchor points to drive the generation process.

The contributions of the study include:

A training strategy for diffusion models that learns local priors of the underlying data distribution, highlighting the superiority of diffusion models compared to traditional VAE architectures for localized shape manipulations.
A localized 3D model called ShapeFusion that enables direct point manipulation, sculpting, and expression editing directly in the 3D space, providing an interpretable paradigm compared to current methods that rely on the state of the latent code for mesh manipulation.
ShapeFusion generates diverse region samples that outperform current state-of-the-art models and learns strong priors that can substitute current parametric models.

Related Work

The paper discusses disentangled and localized models for 3D shape generation. Disentangled generative models aim to encode underlying factors of variation in data into separate subsets of features, allowing for interpretable and independent control over each factor. Various approaches have been proposed to achieve disentanglement, such as separating shape and pose, learning local facial expression manipulations, and factorizing the latent space. However, achieving spatially disentangled shape editing remains challenging.

The authors propose a method that tackles localized manipulation directly in the 3D space, which is fully interpretable and guarantees spatially localized edits by design, in contrast to prior works that attempt to learn disentangled representations by factorizing the latent space.

The paper also discusses parametric models, which are generative models that enable the generation of new shapes by modifying their compact latent representations. Principal component analysis (PCA) has been widely used in parametric models for faces, bodies, and hands. However, PCA models require a large number of parameters to accurately model diverse datasets.

Recent advancements in diffusion models have revolutionized the field of image generation. The paper mentions several diffusion models applied to 3D shapes, such as learning conditional distributions of point clouds, embedding input point clouds in separate latent spaces, and using deformable tetrahedral grid parametrization. The authors extend a previous diffusion model from point clouds to triangular meshes with fixed topology and enforce localized attribute learning using an inpainting technique during training.

Method

The proposed method addresses the limitations of previous works in achieving fully localized 3D shape manipulation. It introduces a training scheme based on a masked diffusion process and constructs a fully localized model capable of guaranteeing local manipulations in the 3D space. The framework consists of two main components: the Forward Diffusion process, which gradually adds noise to the input mesh, and the Denoising Module, which predicts the denoised version of the input. An overview of the proposed method is illustrated in Figure 2.

$Figure 2: Method overview: We propose a 3D diffusion model for localized attribute manipulation and editing. During forward diffusion step, noise is gradually added to random regions of the mesh, indicated by a mask 𝐌𝐌\mathbf{M}bold_M. In the denoising step, a hierarchical network based on mesh convolution is used to learn a prior distribution of each attribute directly on the vertex space.$

Figure 2: Method overview: We propose a 3D diffusion model for localized attribute manipulation and editing. During forward diffusion step, noise is gradually added to random regions of the mesh, indicated by a mask 𝐌𝐌\mathbf{M}bold_M. In the denoising step, a hierarchical network based on mesh convolution is used to learn a prior distribution of each attribute directly on the vertex space.

The paper introduces a masked forward diffusion process for localized editing of 3D meshes. Noise is gradually added to specific areas of the mesh defined by a mask, while the remaining vertices remain unaffected. This approach guarantees fully localized editing and direct manipulation of any point and region of the mesh.

A denoising module is trained to predict the noise added to the input using a hierarchical mesh convolution layer. This layer allows information propagation between distant regions of the shape and enforces the manipulated regions to respect the unmasked geometry. The network utilizes vertex-index positional encoding to break permutation equivariance and learn a vertex-specific prior.

The hierarchical mesh convolution layer operates on different mesh resolutions, with features calculated recursively from coarser to finer levels. Spiral mesh convolutions are used to define the neighborhood of each vertex uniquely.

Experiments

The paper utilizes three datasets (UHM, STAR, and MimicMe) to train and evaluate the proposed method for disentangled manipulation of 3D face and body meshes. The model is compared against baselines including SD, LED, and a VAE method (M-VAE).

Quantitative evaluation on localized region sampling shows the proposed method outperforms baselines in terms of diversity, identity preservation, and FID score. Qualitative results demonstrate the proposed method achieves localized manipulation without affecting other regions.

A key property is the ability to locally edit regions conditioned on a single anchor point, without requiring optimization like previous methods. This enables direct point manipulation approximately 10 times faster than baselines.

The model can also be used as a generative prior for unconditioned face/body generation by masking the entire shape region. As an autodecoder, it reconstructs sparse inputs well, achieving lower error than PCA and SD methods with just 200 anchor points.

Region swapping between identities is demonstrated, with applications to aesthetic medicine. For localized expression editing, the method is compared to NFR, the state-of-the-art. It achieves fully-localized edits and generalizes to out-of-distribution expressions, while being 20 times faster than NFR's optimization.

Conclusion

The paper presents a diffusion 3D model for localized shape manipulation. The method uses an inpainting-inspired training technique to ensure local editing of selected regions. It outperforms current state-of-the-art disentangled manipulation methods and addresses their limitations. Experiments demonstrate the method's ability to manipulate facial and body parts as well as expressions using single or multiple anchor points. The method serves as an interactive 3D editing tool for digital artists and has applications in aesthetic medicine. The research was supported by EPSRC Projects DEFORM (EP/S010203/1) and GNOMON (EP/X011364).

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🤿

Morphable Diffusion: 3D-Consistent Diffusion for Single-image Avatar Creation

Xiyi Chen, Marko Mihajlovic, Shaofei Wang, Sergey Prokudin, Siyu Tang

Recent advances in generative diffusion models have enabled the previously unfeasible capability of generating 3D assets from a single input image or a text prompt. In this work, we aim to enhance the quality and functionality of these models for the task of creating controllable, photorealistic human avatars. We achieve this by integrating a 3D morphable model into the state-of-the-art multi-view-consistent diffusion approach. We demonstrate that accurate conditioning of a generative pipeline on the articulated 3D model enhances the baseline model performance on the task of novel view synthesis from a single image. More importantly, this integration facilitates a seamless and accurate incorporation of facial expression and body pose control into the generation process. To the best of our knowledge, our proposed framework is the first diffusion model to enable the creation of fully 3D-consistent, animatable, and photorealistic human avatars from a single image of an unseen subject; extensive quantitative and qualitative evaluations demonstrate the advantages of our approach over existing state-of-the-art avatar creation models on both novel view and novel expression synthesis tasks. The code for our project is publicly available.

4/3/2024

cs.CV cs.AI

Generating Images with 3D Annotations Using Diffusion Models

Wufei Ma, Qihao Liu, Jiahao Wang, Angtian Wang, Xiaoding Yuan, Yi Zhang, Zihao Xiao, Guofeng Zhang, Beijia Lu, Ruxiao Duan, Yongrui Qi, Adam Kortylewski, Yaoyao Liu, Alan Yuille

Diffusion models have emerged as a powerful generative method, capable of producing stunning photo-realistic images from natural language descriptions. However, these models lack explicit control over the 3D structure in the generated images. Consequently, this hinders our ability to obtain detailed 3D annotations for the generated images or to craft instances with specific poses and distances. In this paper, we propose 3D Diffusion Style Transfer (3D-DST), which incorporates 3D geometry control into diffusion models. Our method exploits ControlNet, which extends diffusion models by using visual prompts in addition to text prompts. We generate images of the 3D objects taken from 3D shape repositories (e.g., ShapeNet and Objaverse), render them from a variety of poses and viewing directions, compute the edge maps of the rendered images, and use these edge maps as visual prompts to generate realistic images. With explicit 3D geometry control, we can easily change the 3D structures of the objects in the generated images and obtain ground-truth 3D annotations automatically. This allows us to improve a wide range of vision tasks, e.g., classification and 3D pose estimation, in both in-distribution (ID) and out-of-distribution (OOD) settings. We demonstrate the effectiveness of our method through extensive experiments on ImageNet-100/200, ImageNet-R, PASCAL3D+, ObjectNet3D, and OOD-CV. The results show that our method significantly outperforms existing methods, e.g., 3.8 percentage points on ImageNet-100 using DeiT-B.

4/5/2024

cs.CV

🛸

Part-aware Shape Generation with Latent 3D Diffusion of Neural Voxel Fields

Yuhang Huang, SHilong Zou, Xinwang Liu, Kai Xu

This paper presents a novel latent 3D diffusion model for the generation of neural voxel fields, aiming to achieve accurate part-aware structures. Compared to existing methods, there are two key designs to ensure high-quality and accurate part-aware generation. On one hand, we introduce a latent 3D diffusion process for neural voxel fields, enabling generation at significantly higher resolutions that can accurately capture rich textural and geometric details. On the other hand, a part-aware shape decoder is introduced to integrate the part codes into the neural voxel fields, guiding the accurate part decomposition and producing high-quality rendering results. Through extensive experimentation and comparisons with state-of-the-art methods, we evaluate our approach across four different classes of data. The results demonstrate the superior generative capabilities of our proposed method in part-aware shape generation, outperforming existing state-of-the-art methods.

5/10/2024

cs.CV

📈

4D Facial Expression Diffusion Model

Kaifeng Zou, Sylvain Faisan, Boyang Yu, S'ebastien Valette, Hyewon Seo

Facial expression generation is one of the most challenging and long-sought aspects of character animation, with many interesting applications. The challenging task, traditionally having relied heavily on digital craftspersons, remains yet to be explored. In this paper, we introduce a generative framework for generating 3D facial expression sequences (i.e. 4D faces) that can be conditioned on different inputs to animate an arbitrary 3D face mesh. It is composed of two tasks: (1) Learning the generative model that is trained over a set of 3D landmark sequences, and (2) Generating 3D mesh sequences of an input facial mesh driven by the generated landmark sequences. The generative model is based on a Denoising Diffusion Probabilistic Model (DDPM), which has achieved remarkable success in generative tasks of other domains. While it can be trained unconditionally, its reverse process can still be conditioned by various condition signals. This allows us to efficiently develop several downstream tasks involving various conditional generation, by using expression labels, text, partial sequences, or simply a facial geometry. To obtain the full mesh deformation, we then develop a landmark-guided encoder-decoder to apply the geometrical deformation embedded in landmarks on a given facial mesh. Experiments show that our model has learned to generate realistic, quality expressions solely from the dataset of relatively small size, improving over the state-of-the-art methods. Videos and qualitative comparisons with other methods can be found at url{https://github.com/ZOUKaifeng/4DFM}.

4/16/2024

cs.CV