$text{Di}^2text{Pose}$: Discrete Diffusion Model for Occluded 3D Human Pose Estimation

2405.17016

Published 5/28/2024 by Weiquan Wang, Jun Xiao, Chunping Wang, Wei Liu, Zhao Wang, Long Chen

$$text{Di}^2text{Pose}$: Discrete Diffusion Model for Occluded 3D Human Pose Estimation$

Abstract

Continuous diffusion models have demonstrated their effectiveness in addressing the inherent uncertainty and indeterminacy in monocular 3D human pose estimation (HPE). Despite their strengths, the need for large search spaces and the corresponding demand for substantial training data make these models prone to generating biomechanically unrealistic poses. This challenge is particularly noticeable in occlusion scenarios, where the complexity of inferring 3D structures from 2D images intensifies. In response to these limitations, we introduce the Discrete Diffusion Pose ($text{Di}^2text{Pose}$), a novel framework designed for occluded 3D HPE that capitalizes on the benefits of a discrete diffusion model. Specifically, $text{Di}^2text{Pose}$ employs a two-stage process: it first converts 3D poses into a discrete representation through a emph{pose quantization step}, which is subsequently modeled in latent space through a emph{discrete diffusion process}. This methodological innovation restrictively confines the search space towards physically viable configurations and enhances the model's capability to comprehend how occlusions affect human pose within the latent space. Extensive evaluations conducted on various benchmarks (e.g., Human3.6M, 3DPW, and 3DPW-Occ) have demonstrated its effectiveness.

Create account to get full access

Overview

• This paper proposes a novel discrete diffusion model, called "Di"²⁢"Pose", for the task of 3D human pose estimation in the presence of occlusions.

• The model leverages a powerful discrete diffusion process to generate plausible 3D human poses from partial and occluded observations, addressing the challenges of ambiguity and uncertainty inherent in this problem.

• The paper demonstrates the effectiveness of the "Di"²⁢"Pose" model on several benchmark datasets, outperforming state-of-the-art methods for 3D human pose estimation under occlusion.

Plain English Explanation

• Estimating the 3D position of a person's body parts (like the arms, legs, and torso) from camera images is a challenging task, especially when parts of the person are hidden or obscured (occluded).

• The "Di"²⁢"Pose" model developed in this paper uses a novel approach called "discrete diffusion" to generate realistic 3D human poses from partial or occluded observations.

• Discrete diffusion is a technique that slowly adds noise to an image or 3D model, then learns to reverse the process to reconstruct the original. This allows the model to handle the ambiguity and uncertainty inherent in estimating 3D poses from partial data.

• The authors show that "Di"²⁢"Pose" outperforms other state-of-the-art methods for 3D human pose estimation in the presence of occlusions, making it a promising approach for real-world applications where people are often partially obscured.

Technical Explanation

• The "Di"²⁢"Pose" model uses a discrete diffusion process to generate 3D human pose estimates from partial or occluded observations.

• The model consists of an encoder network that takes in the partial 3D pose data and produces a latent representation, and a diffusion-based decoder network that iteratively refines this latent representation to recover the full 3D pose.

• The diffusion process gradually adds noise to the latent representation, then the decoder learns to reverse this process to reconstruct the original 3D pose. This allows the model to handle ambiguity and uncertainty in the input data.

• The authors evaluate "Di"²⁢"Pose" on several benchmark datasets for 3D human pose estimation under occlusion, and show that it outperforms previous state-of-the-art methods in terms of accuracy and robustness.

• Key insights from the paper include the effectiveness of the discrete diffusion approach for handling occluded 3D data, and the importance of modeling the inherent ambiguity in this task to produce high-quality pose estimates.

Critical Analysis

• The paper provides a thorough evaluation of the "Di"²⁢"Pose" model on several benchmark datasets, demonstrating its superior performance over existing methods for 3D human pose estimation under occlusion.

• However, the paper does not discuss any potential limitations or caveats of the proposed approach. For example, it would be useful to understand the computational complexity of the model, or how it might perform on more diverse or challenging datasets.

• Additionally, the paper could have explored the potential real-world applications and implications of this research, such as how it could be used in human-computer interaction, robotics, or virtual/augmented reality systems.

• Overall, the paper presents a novel and promising approach to a challenging problem, but additional discussion of the method's limitations and broader impact would strengthen the contribution.

Conclusion

• The "Di"²⁢"Pose" model presented in this paper offers a novel and effective solution for 3D human pose estimation in the presence of occlusions, leveraging a powerful discrete diffusion process to handle the inherent ambiguity and uncertainty in this task.

• The authors demonstrate the superiority of their approach over state-of-the-art methods on several benchmark datasets, suggesting that "Di"²⁢"Pose" could be a valuable tool for a wide range of applications involving 3D human pose estimation in real-world, occluded environments.

• While the paper could have provided more discussion of the method's limitations and potential real-world implications, it represents an important contribution to the field of 3D computer vision and human pose analysis.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🔄

FinePOSE: Fine-Grained Prompt-Driven 3D Human Pose Estimation via Diffusion Models

Jinglin Xu, Yijie Guo, Yuxin Peng

The 3D Human Pose Estimation (3D HPE) task uses 2D images or videos to predict human joint coordinates in 3D space. Despite recent advancements in deep learning-based methods, they mostly ignore the capability of coupling accessible texts and naturally feasible knowledge of humans, missing out on valuable implicit supervision to guide the 3D HPE task. Moreover, previous efforts often study this task from the perspective of the whole human body, neglecting fine-grained guidance hidden in different body parts. To this end, we present a new Fine-Grained Prompt-Driven Denoiser based on a diffusion model for 3D HPE, named textbf{FinePOSE}. It consists of three core blocks enhancing the reverse process of the diffusion model: (1) Fine-grained Part-aware Prompt learning (FPP) block constructs fine-grained part-aware prompts via coupling accessible texts and naturally feasible knowledge of body parts with learnable prompts to model implicit guidance. (2) Fine-grained Prompt-pose Communication (FPC) block establishes fine-grained communications between learned part-aware prompts and poses to improve the denoising quality. (3) Prompt-driven Timestamp Stylization (PTS) block integrates learned prompt embedding and temporal information related to the noise level to enable adaptive adjustment at each denoising step. Extensive experiments on public single-human pose estimation datasets show that FinePOSE outperforms state-of-the-art methods. We further extend FinePOSE to multi-human pose estimation. Achieving 34.3mm average MPJPE on the EgoHumans dataset demonstrates the potential of FinePOSE to deal with complex multi-human scenarios. Code is available at https://github.com/PKU-ICST-MIPL/FinePOSE_CVPR2024.

5/9/2024

cs.CV

🔎

Confronting Ambiguity in 6D Object Pose Estimation via Score-Based Diffusion on SE(3)

Tsu-Ching Hsiao, Hao-Wei Chen, Hsuan-Kung Yang, Chun-Yi Lee

Addressing pose ambiguity in 6D object pose estimation from single RGB images presents a significant challenge, particularly due to object symmetries or occlusions. In response, we introduce a novel score-based diffusion method applied to the $SE(3)$ group, marking the first application of diffusion models to $SE(3)$ within the image domain, specifically tailored for pose estimation tasks. Extensive evaluations demonstrate the method's efficacy in handling pose ambiguity, mitigating perspective-induced ambiguity, and showcasing the robustness of our surrogate Stein score formulation on $SE(3)$. This formulation not only improves the convergence of denoising process but also enhances computational efficiency. Thus, we pioneer a promising strategy for 6D object pose estimation.

4/9/2024

cs.CV

DPMesh: Exploiting Diffusion Prior for Occluded Human Mesh Recovery

Yixuan Zhu, Ao Li, Yansong Tang, Wenliang Zhao, Jie Zhou, Jiwen Lu

The recovery of occluded human meshes presents challenges for current methods due to the difficulty in extracting effective image features under severe occlusion. In this paper, we introduce DPMesh, an innovative framework for occluded human mesh recovery that capitalizes on the profound diffusion prior about object structure and spatial relationships embedded in a pre-trained text-to-image diffusion model. Unlike previous methods reliant on conventional backbones for vanilla feature extraction, DPMesh seamlessly integrates the pre-trained denoising U-Net with potent knowledge as its image backbone and performs a single-step inference to provide occlusion-aware information. To enhance the perception capability for occluded poses, DPMesh incorporates well-designed guidance via condition injection, which produces effective controls from 2D observations for the denoising U-Net. Furthermore, we explore a dedicated noisy key-point reasoning approach to mitigate disturbances arising from occlusion and crowded scenarios. This strategy fully unleashes the perceptual capability of the diffusion prior, thereby enhancing accuracy. Extensive experiments affirm the efficacy of our framework, as we outperform state-of-the-art methods on both occlusion-specific and standard datasets. The persuasive results underscore its ability to achieve precise and robust 3D human mesh recovery, particularly in challenging scenarios involving occlusion and crowded scenes.

4/3/2024

cs.CV

VividDreamer: Towards High-Fidelity and Efficient Text-to-3D Generation

Zixuan Chen, Ruijie Su, Jiahao Zhu, Lingxiao Yang, Jian-Huang Lai, Xiaohua Xie

Text-to-3D generation aims to create 3D assets from text-to-image diffusion models. However, existing methods face an inherent bottleneck in generation quality because the widely-used objectives such as Score Distillation Sampling (SDS) inappropriately omit U-Net jacobians for swift generation, leading to significant bias compared to the true gradient obtained by full denoising sampling. This bias brings inconsistent updating direction, resulting in implausible 3D generation e.g., color deviation, Janus problem, and semantically inconsistent details). In this work, we propose Pose-dependent Consistency Distillation Sampling (PCDS), a novel yet efficient objective for diffusion-based 3D generation tasks. Specifically, PCDS builds the pose-dependent consistency function within diffusion trajectories, allowing to approximate true gradients through minimal sampling steps (1-3). Compared to SDS, PCDS can acquire a more accurate updating direction with the same sampling time (1 sampling step), while enabling few-step (2-3) sampling to trade compute for higher generation quality. For efficient generation, we propose a coarse-to-fine optimization strategy, which first utilizes 1-step PCDS to create the basic structure of 3D objects, and then gradually increases PCDS steps to generate fine-grained details. Extensive experiments demonstrate that our approach outperforms the state-of-the-art in generation quality and training efficiency, conspicuously alleviating the implausible 3D generation issues caused by the deviated updating direction. Moreover, it can be simply applied to many 3D generative applications to yield impressive 3D assets, please see our project page: https://narcissusex.github.io/VividDreamer.

6/24/2024

cs.CV