Animatable 3D Gaussian: Fast and High-Quality Reconstruction of Multiple Human Avatars

Read original: arXiv:2311.16482 - Published 7/30/2024 by Yang Liu, Xiang Huang, Minghan Qin, Qinwei Lin, Haoqian Wang

⛏️

Overview

Neural radiance fields can reconstruct high-quality human avatars, but are expensive to train and render, and not suitable for multi-human scenes with complex shadows.
To address these limitations, the authors propose Animatable 3D Gaussian, a method that learns human avatars from input images and poses.
The method extends 3D Gaussians to dynamic human scenes by modeling a set of skinned 3D Gaussians and a corresponding skeleton in canonical space, and deforming the 3D Gaussians to posed space based on input poses.
The authors introduce a multi-head hash encoder for pose-dependent shape and appearance, and a time-dependent ambient occlusion module to achieve high-quality reconstructions in scenes with complex motions and dynamic shadows.

Plain English Explanation

The paper proposes a new way to create high-quality, animatable 3D human avatars from input images and poses. Traditional methods using neural radiance fields can produce great results, but they are computationally expensive and not well-suited for scenes with multiple people and complex shadows.

The key idea is to model the human body using a set of 3D Gaussian shapes, which can be deformed and moved based on the input poses. This allows for efficient rendering and the ability to handle scenes with multiple people. The authors also introduce a clever encoding mechanism to capture the relationship between pose, shape, and appearance, as well as a way to model the effects of dynamic shadows.

Compared to previous methods, this approach achieves similar or better quality in terms of novel view synthesis and novel pose synthesis, but with much faster training (1/60th the time) and rendering (7x faster), as well as lower memory requirements (1/4 the GPU memory). This makes it practical for real-world applications, like virtual avatars for games or video calls.

Technical Explanation

The core of the Animatable 3D Gaussian method is the representation of the human body using a set of skinned 3D Gaussians and a corresponding skeleton in a canonical, or reference, pose. When input poses are provided, the 3D Gaussians are deformed according to the skeleton transformations to match the desired pose.

To capture the relationship between pose, shape, and appearance, the authors introduce a multi-head hash encoder. This allows the model to efficiently learn how the 3D Gaussian parameters and textures should change based on the input pose.

Additionally, the authors add a time-dependent ambient occlusion module to their model to account for dynamic shadows and lighting changes in the scene. This helps produce high-quality reconstructions even in complex scenes with multiple people in motion.

Compared to previous methods like InstantAvatar and GomAvatar, the Animatable 3D Gaussian approach achieves better novel view synthesis and novel pose synthesis results, while requiring significantly less training time, GPU memory, and rendering time.

Critical Analysis

The authors acknowledge that their method is limited to capturing the overall shape and appearance of the human body, and may not be able to faithfully reproduce fine details like facial expressions or clothing textures. Additionally, the current implementation is designed for single-person scenes, and the authors note that extending it to handle occlusions and interactions in multi-person scenes may require further research.

While the performance improvements over previous methods are impressive, it's worth considering how the tradeoffs between quality, training cost, and rendering speed may impact real-world use cases. For example, applications that prioritize the highest possible fidelity may still prefer neural radiance field-based approaches, despite their higher computational requirements.

Overall, the Animatable 3D Gaussian method represents an interesting and practical advance in the field of animatable human avatars. The authors' focus on efficiency and the ability to handle complex scenes is a valuable contribution, and the techniques they introduce could potentially be applied to other 3D computer vision and graphics problems.

Conclusion

The Animatable 3D Gaussian paper presents a novel approach for reconstructing high-quality, animatable human avatars from input images and poses. By modeling the human body using a set of deformable 3D Gaussians, the method achieves significant improvements in training time, memory usage, and rendering speed compared to previous state-of-the-art techniques, while maintaining comparable or better reconstruction quality.

This work demonstrates the potential for efficient, real-time applications of animatable human avatars, such as in virtual communication, gaming, and entertainment. The authors' contributions to pose-dependent shape and appearance modeling, as well as their handling of dynamic lighting and shadows, are important advancements that could influence future research in this field.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

⛏️

Animatable 3D Gaussian: Fast and High-Quality Reconstruction of Multiple Human Avatars

Yang Liu, Xiang Huang, Minghan Qin, Qinwei Lin, Haoqian Wang

Neural radiance fields are capable of reconstructing high-quality drivable human avatars but are expensive to train and render and not suitable for multi-human scenes with complex shadows. To reduce consumption, we propose Animatable 3D Gaussian, which learns human avatars from input images and poses. We extend 3D Gaussians to dynamic human scenes by modeling a set of skinned 3D Gaussians and a corresponding skeleton in canonical space and deforming 3D Gaussians to posed space according to the input poses. We introduce a multi-head hash encoder for pose-dependent shape and appearance and a time-dependent ambient occlusion module to achieve high-quality reconstructions in scenes containing complex motions and dynamic shadows. On both novel view synthesis and novel pose synthesis tasks, our method achieves higher reconstruction quality than InstantAvatar with less training time (1/60), less GPU memory (1/4), and faster rendering speed (7x). Our method can be easily extended to multi-human scenes and achieve comparable novel view synthesis results on a scene with ten people in only 25 seconds of training.

7/30/2024

Animatable and Relightable Gaussians for High-fidelity Human Avatar Modeling

Zhe Li, Yipengjing Sun, Zerong Zheng, Lizhen Wang, Shengping Zhang, Yebin Liu

Modeling animatable human avatars from RGB videos is a long-standing and challenging problem. Recent works usually adopt MLP-based neural radiance fields (NeRF) to represent 3D humans, but it remains difficult for pure MLPs to regress pose-dependent garment details. To this end, we introduce Animatable Gaussians, a new avatar representation that leverages powerful 2D CNNs and 3D Gaussian splatting to create high-fidelity avatars. To associate 3D Gaussians with the animatable avatar, we learn a parametric template from the input videos, and then parameterize the template on two front & back canonical Gaussian maps where each pixel represents a 3D Gaussian. The learned template is adaptive to the wearing garments for modeling looser clothes like dresses. Such template-guided 2D parameterization enables us to employ a powerful StyleGAN-based CNN to learn the pose-dependent Gaussian maps for modeling detailed dynamic appearances. Furthermore, we introduce a pose projection strategy for better generalization given novel poses. To tackle the realistic relighting of animatable avatars, we introduce physically-based rendering into the avatar representation for decomposing avatar materials and environment illumination. Overall, our method can create lifelike avatars with dynamic, realistic, generalized and relightable appearances. Experiments show that our method outperforms other state-of-the-art approaches.

5/28/2024

SG-GS: Photo-realistic Animatable Human Avatars with Semantically-Guided Gaussian Splatting

Haoyu Zhao, Chen Yang, Hao Wang, Xingyue Zhao, Wei Shen

Reconstructing photo-realistic animatable human avatars from monocular videos remains challenging in computer vision and graphics. Recently, methods using 3D Gaussians to represent the human body have emerged, offering faster optimization and real-time rendering. However, due to ignoring the crucial role of human body semantic information which represents the intrinsic structure and connections within the human body, they fail to achieve fine-detail reconstruction of dynamic human avatars. To address this issue, we propose SG-GS, which uses semantics-embedded 3D Gaussians, skeleton-driven rigid deformation, and non-rigid cloth dynamics deformation to create photo-realistic animatable human avatars from monocular videos. We then design a Semantic Human-Body Annotator (SHA) which utilizes SMPL's semantic prior for efficient body part semantic labeling. The generated labels are used to guide the optimization of Gaussian semantic attributes. To address the limited receptive field of point-level MLPs for local features, we also propose a 3D network that integrates geometric and semantic associations for human avatar deformation. We further implement three key strategies to enhance the semantic accuracy of 3D Gaussians and rendering quality: semantic projection with 2D regularization, semantic-guided density regularization and semantic-aware regularization with neighborhood consistency. Extensive experiments demonstrate that SG-GS achieves state-of-the-art geometry and appearance reconstruction performance.

8/20/2024

Interactive Rendering of Relightable and Animatable Gaussian Avatars

Youyi Zhan, Tianjia Shao, He Wang, Yin Yang, Kun Zhou

Creating relightable and animatable avatars from multi-view or monocular videos is a challenging task for digital human creation and virtual reality applications. Previous methods rely on neural radiance fields or ray tracing, resulting in slow training and rendering processes. By utilizing Gaussian Splatting, we propose a simple and efficient method to decouple body materials and lighting from sparse-view or monocular avatar videos, so that the avatar can be rendered simultaneously under novel viewpoints, poses, and lightings at interactive frame rates (6.9 fps). Specifically, we first obtain the canonical body mesh using a signed distance function and assign attributes to each mesh vertex. The Gaussians in the canonical space then interpolate from nearby body mesh vertices to obtain the attributes. We subsequently deform the Gaussians to the posed space using forward skinning, and combine the learnable environment light with the Gaussian attributes for shading computation. To achieve fast shadow modeling, we rasterize the posed body mesh from dense viewpoints to obtain the visibility. Our approach is not only simple but also fast enough to allow interactive rendering of avatar animation under environmental light changes. Experiments demonstrate that, compared to previous works, our method can render higher quality results at a faster speed on both synthetic and real datasets.

7/16/2024