MOSS: Motion-based 3D Clothed Human Synthesis from Monocular Video

2405.12806

YC

0

Reddit

0

Published 6/26/2024 by Hongsheng Wang, Xiang Cai, Xi Sun, Jinhong Yue, Zhanyun Tang, Shengyu Zhang, Feng Lin, Fei Wu

🤖

Abstract

Single-view clothed human reconstruction holds a central position in virtual reality applications, especially in contexts involving intricate human motions. It presents notable challenges in achieving realistic clothing deformation. Current methodologies often overlook the influence of motion on surface deformation, resulting in surfaces lacking the constraints imposed by global motion. To overcome these limitations, we introduce an innovative framework, Motion-Based 3D Clo}thed Humans Synthesis (MOSS), which employs kinematic information to achieve motion-aware Gaussian split on the human surface. Our framework consists of two modules: Kinematic Gaussian Locating Splatting (KGAS) and Surface Deformation Detector (UID). KGAS incorporates matrix-Fisher distribution to propagate global motion across the body surface. The density and rotation factors of this distribution explicitly control the Gaussians, thereby enhancing the realism of the reconstructed surface. Additionally, to address local occlusions in single-view, based on KGAS, UID identifies significant surfaces, and geometric reconstruction is performed to compensate for these deformations. Experimental results demonstrate that MOSS achieves state-of-the-art visual quality in 3D clothed human synthesis from monocular videos. Notably, we improve the Human NeRF and the Gaussian Splatting by 33.94% and 16.75% in LPIPS* respectively. Codes are available at https://wanghongsheng01.github.io/MOSS/.

Create account to get full access

or

If you already have an account, we'll log you in

Overview

  • This paper introduces an innovative framework called Motion-Based 3D Clothed Humans Synthesis (MOSS) that addresses the challenge of achieving realistic clothing deformation in single-view clothed human reconstruction for virtual reality applications.
  • The framework consists of two key modules: Kinematic Gaussian Locating Splatting (KGAS) and Surface Deformation Detector (UID).
  • KGAS incorporates matrix-Fisher distribution to propagate global motion across the body surface, enhancing the realism of the reconstructed surface.
  • UID identifies significant surfaces based on KGAS and performs geometric reconstruction to compensate for local occlusions in single-view scenarios.

Plain English Explanation

The paper focuses on a challenging problem in virtual reality (VR) applications: how to create realistic-looking 3D models of people wearing clothes. Current methods often struggle to capture the way clothes move and deform as the person moves, resulting in unrealistic-looking surfaces.

To address this, the researchers developed a new framework called MOSS. The key idea is to use information about the person's movement, or "kinematics," to better model how the clothes should deform.

MOSS has two main components. The first is KGAS, which uses a statistical distribution called the matrix-Fisher distribution to spread the effect of the person's movement across the entire surface of the clothes. This helps create more natural-looking deformations.

The second component is UID, which identifies areas on the clothes that are likely to be occluded (hidden) from the single camera view. It then performs additional geometric reconstruction in those areas to compensate for the missing information.

Together, these techniques allow MOSS to create 3D clothed human models from monocular (single-camera) videos that look more realistic, especially when the person is moving, compared to previous methods. The researchers show significant improvements in visual quality metrics compared to other state-of-the-art approaches.

Technical Explanation

The paper introduces the Motion-Based 3D Clothed Humans Synthesis (MOSS) framework to address the challenges in achieving realistic clothing deformation in single-view clothed human reconstruction.

The core of MOSS is the Kinematic Gaussian Locating Splatting (KGAS) module, which incorporates the matrix-Fisher distribution to propagate global motion information across the human body surface. This distribution's density and rotation factors explicitly control the Gaussians, enhancing the realism of the reconstructed surface.

To address local occlusions in the single-view scenario, the Surface Deformation Detector (UID) module is developed based on KGAS. UID identifies significant surfaces and performs geometric reconstruction to compensate for these deformations.

The experimental results demonstrate that MOSS achieves state-of-the-art visual quality in 3D clothed human synthesis from monocular videos. Specifically, the authors report improvements of 33.94% and 16.75% in the LPIPS* metric compared to the Human NeRF and Gaussian Splatting approaches, respectively.

Critical Analysis

The paper presents a compelling solution to the challenging problem of realistic 3D clothed human reconstruction from a single camera view. The incorporation of kinematic information to guide the surface deformation is a key innovation that sets this work apart from previous methods.

However, the paper does not discuss the potential limitations of the MOSS framework. For example, it is unclear how the framework would perform in scenarios with significant occlusions or complex clothing. Additionally, the computational complexity of the KGAS and UID modules is not addressed, which could be an important consideration for real-time applications.

Further research could explore the robustness of MOSS to variations in input data, such as different camera viewpoints or clothing styles. Investigating the scalability of the framework to handle larger and more diverse datasets would also be valuable.

Overall, the MOSS framework represents an important step forward in the field of single-view clothed human reconstruction. By leveraging kinematic information, the researchers have developed a promising approach to enhance the realism of virtual human representations, which could have significant implications for a wide range of VR and animation applications.

Conclusion

This paper introduces the Motion-Based 3D Clothed Humans Synthesis (MOSS) framework, a novel approach to addressing the challenge of realistic clothing deformation in single-view clothed human reconstruction. By incorporating kinematic information through the KGAS and UID modules, MOSS is able to generate 3D clothed human models with significantly improved visual quality compared to previous state-of-the-art methods.

The key innovations of MOSS, such as the use of the matrix-Fisher distribution and the geometric reconstruction technique, demonstrate the potential of leveraging motion data to enhance the realism of virtual human representations. As virtual reality and animation continue to grow in importance, advancements like MOSS will be crucial in creating more immersive and realistic experiences.

While the paper does not address certain limitations, the MOSS framework represents an important step forward in the field of single-view clothed human reconstruction. Further research and development in this area could lead to even more realistic and compelling virtual human models, with far-reaching implications for a variety of industries and applications.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

MoSca: Dynamic Gaussian Fusion from Casual Videos via 4D Motion Scaffolds

MoSca: Dynamic Gaussian Fusion from Casual Videos via 4D Motion Scaffolds

Jiahui Lei, Yijia Weng, Adam Harley, Leonidas Guibas, Kostas Daniilidis

YC

0

Reddit

0

We introduce 4D Motion Scaffolds (MoSca), a neural information processing system designed to reconstruct and synthesize novel views of dynamic scenes from monocular videos captured casually in the wild. To address such a challenging and ill-posed inverse problem, we leverage prior knowledge from foundational vision models, lift the video data to a novel Motion Scaffold (MoSca) representation, which compactly and smoothly encodes the underlying motions / deformations. The scene geometry and appearance are then disentangled from the deformation field, and are encoded by globally fusing the Gaussians anchored onto the MoSca and optimized via Gaussian Splatting. Additionally, camera poses can be seamlessly initialized and refined during the dynamic rendering process, without the need for other pose estimation tools. Experiments demonstrate state-of-the-art performance on dynamic rendering benchmarks.

Read more

5/28/2024

Guess The Unseen: Dynamic 3D Scene Reconstruction from Partial 2D Glimpses

Guess The Unseen: Dynamic 3D Scene Reconstruction from Partial 2D Glimpses

Inhee Lee, Byungjun Kim, Hanbyul Joo

YC

0

Reddit

0

In this paper, we present a method to reconstruct the world and multiple dynamic humans in 3D from a monocular video input. As a key idea, we represent both the world and multiple humans via the recently emerging 3D Gaussian Splatting (3D-GS) representation, enabling to conveniently and efficiently compose and render them together. In particular, we address the scenarios with severely limited and sparse observations in 3D human reconstruction, a common challenge encountered in the real world. To tackle this challenge, we introduce a novel approach to optimize the 3D-GS representation in a canonical space by fusing the sparse cues in the common space, where we leverage a pre-trained 2D diffusion model to synthesize unseen views while keeping the consistency with the observed 2D appearances. We demonstrate our method can reconstruct high-quality animatable 3D humans in various challenging examples, in the presence of occlusion, image crops, few-shot, and extremely sparse observations. After reconstruction, our method is capable of not only rendering the scene in any novel views at arbitrary time instances, but also editing the 3D scene by removing individual humans or applying different motions for each human. Through various experiments, we demonstrate the quality and efficiency of our methods over alternative existing approaches.

Read more

4/23/2024

Shape Conditioned Human Motion Generation with Diffusion Model

Shape Conditioned Human Motion Generation with Diffusion Model

Kebing Xue, Hyewon Seo

YC

0

Reddit

0

Human motion synthesis is an important task in computer graphics and computer vision. While focusing on various conditioning signals such as text, action class, or audio to guide the generation process, most existing methods utilize skeleton-based pose representation, requiring additional skinning to produce renderable meshes. Given that human motion is a complex interplay of bones, joints, and muscles, considering solely the skeleton for generation may neglect their inherent interdependency, which can limit the variability and precision of the generated results. To address this issue, we propose a Shape-conditioned Motion Diffusion model (SMD), which enables the generation of motion sequences directly in mesh format, conditioned on a specified target mesh. In SMD, the input meshes are transformed into spectral coefficients using graph Laplacian, to efficiently represent meshes. Subsequently, we propose a Spectral-Temporal Autoencoder (STAE) to leverage cross-temporal dependencies within the spectral domain. Extensive experimental evaluations show that SMD not only produces vivid and realistic motions but also achieves competitive performance in text-to-motion and action-to-motion tasks when compared to state-of-the-art methods.

Read more

5/14/2024

SurMo: Surface-based 4D Motion Modeling for Dynamic Human Rendering

SurMo: Surface-based 4D Motion Modeling for Dynamic Human Rendering

Tao Hu, Fangzhou Hong, Ziwei Liu

YC

0

Reddit

0

Dynamic human rendering from video sequences has achieved remarkable progress by formulating the rendering as a mapping from static poses to human images. However, existing methods focus on the human appearance reconstruction of every single frame while the temporal motion relations are not fully explored. In this paper, we propose a new 4D motion modeling paradigm, SurMo, that jointly models the temporal dynamics and human appearances in a unified framework with three key designs: 1) Surface-based motion encoding that models 4D human motions with an efficient compact surface-based triplane. It encodes both spatial and temporal motion relations on the dense surface manifold of a statistical body template, which inherits body topology priors for generalizable novel view synthesis with sparse training observations. 2) Physical motion decoding that is designed to encourage physical motion learning by decoding the motion triplane features at timestep t to predict both spatial derivatives and temporal derivatives at the next timestep t+1 in the training stage. 3) 4D appearance decoding that renders the motion triplanes into images by an efficient volumetric surface-conditioned renderer that focuses on the rendering of body surfaces with motion learning conditioning. Extensive experiments validate the state-of-the-art performance of our new paradigm and illustrate the expressiveness of surface-based motion triplanes for rendering high-fidelity view-consistent humans with fast motions and even motion-dependent shadows. Our project page is at: https://taohuumd.github.io/projects/SurMo/

Read more

4/3/2024