TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Disentangled Diffusion-Based 3D Human Pose Estimation with...

Disentangled Diffusion-Based 3D Human Pose Estimation with Hierarchical Spatial and Temporal Denoiser

Qingyuan Cai, Xuecai Hu, Saihui Hou, Li Yao, Yongzhen Huang

2024-03-073D Human Pose EstimationMonocular 3D Human Pose EstimationDisentanglementMulti-Hypotheses 3D Human Pose EstimationPose Estimation
PaperPDFCode(official)

Abstract

Recently, diffusion-based methods for monocular 3D human pose estimation have achieved state-of-the-art (SOTA) performance by directly regressing the 3D joint coordinates from the 2D pose sequence. Although some methods decompose the task into bone length and bone direction prediction based on the human anatomical skeleton to explicitly incorporate more human body prior constraints, the performance of these methods is significantly lower than that of the SOTA diffusion-based methods. This can be attributed to the tree structure of the human skeleton. Direct application of the disentangled method could amplify the accumulation of hierarchical errors, propagating through each hierarchy. Meanwhile, the hierarchical information has not been fully explored by the previous methods. To address these problems, a Disentangled Diffusion-based 3D Human Pose Estimation method with Hierarchical Spatial and Temporal Denoiser is proposed, termed DDHPose. In our approach: (1) We disentangle the 3D pose and diffuse the bone length and bone direction during the forward process of the diffusion model to effectively model the human pose prior. A disentanglement loss is proposed to supervise diffusion model learning. (2) For the reverse process, we propose Hierarchical Spatial and Temporal Denoiser (HSTDenoiser) to improve the hierarchical modeling of each joint. Our HSTDenoiser comprises two components: the Hierarchical-Related Spatial Transformer (HRST) and the Hierarchical-Related Temporal Transformer (HRTT). HRST exploits joint spatial information and the influence of the parent joint on each joint for spatial modeling, while HRTT utilizes information from both the joint and its hierarchical adjacent joints to explore the hierarchical temporal correlations among joints.

Results

TaskDatasetMetricValueModel
3D Human Pose EstimationHuman3.6MAverage MPJPE (mm)39.7DDHPose
3D Human Pose EstimationHuman3.6MFrames Needed243DDHPose
3D Human Pose EstimationHuman3.6MAverage MPJPE (mm)33.62DDHPose (H=20, W=10, J-Best)
3D Human Pose EstimationHuman3.6MAverage PMPJPE (mm)26.48DDHPose (H=20, W=10, J-Best)
3D Human Pose EstimationHuman3.6MAverage MPJPE (mm)39DDHPose (H=20, W=10, P-Best)
3D Human Pose EstimationHuman3.6MAverage PMPJPE (mm)31.2DDHPose (H=20, W=10, P-Best)
Pose EstimationHuman3.6MAverage MPJPE (mm)39.7DDHPose
Pose EstimationHuman3.6MFrames Needed243DDHPose
Pose EstimationHuman3.6MAverage MPJPE (mm)33.62DDHPose (H=20, W=10, J-Best)
Pose EstimationHuman3.6MAverage PMPJPE (mm)26.48DDHPose (H=20, W=10, J-Best)
Pose EstimationHuman3.6MAverage MPJPE (mm)39DDHPose (H=20, W=10, P-Best)
Pose EstimationHuman3.6MAverage PMPJPE (mm)31.2DDHPose (H=20, W=10, P-Best)
3DHuman3.6MAverage MPJPE (mm)39.7DDHPose
3DHuman3.6MFrames Needed243DDHPose
3DHuman3.6MAverage MPJPE (mm)33.62DDHPose (H=20, W=10, J-Best)
3DHuman3.6MAverage PMPJPE (mm)26.48DDHPose (H=20, W=10, J-Best)
3DHuman3.6MAverage MPJPE (mm)39DDHPose (H=20, W=10, P-Best)
3DHuman3.6MAverage PMPJPE (mm)31.2DDHPose (H=20, W=10, P-Best)
1 Image, 2*2 StitchiHuman3.6MAverage MPJPE (mm)39.7DDHPose
1 Image, 2*2 StitchiHuman3.6MFrames Needed243DDHPose
1 Image, 2*2 StitchiHuman3.6MAverage MPJPE (mm)33.62DDHPose (H=20, W=10, J-Best)
1 Image, 2*2 StitchiHuman3.6MAverage PMPJPE (mm)26.48DDHPose (H=20, W=10, J-Best)
1 Image, 2*2 StitchiHuman3.6MAverage MPJPE (mm)39DDHPose (H=20, W=10, P-Best)
1 Image, 2*2 StitchiHuman3.6MAverage PMPJPE (mm)31.2DDHPose (H=20, W=10, P-Best)

Related Papers

CSD-VAR: Content-Style Decomposition in Visual Autoregressive Models2025-07-18$π^3$: Scalable Permutation-Equivariant Visual Geometry Learning2025-07-17Revisiting Reliability in the Reasoning-based Pose Estimation Benchmark2025-07-17DINO-VO: A Feature-based Visual Odometry Leveraging a Visual Foundation Model2025-07-17From Neck to Head: Bio-Impedance Sensing for Head Pose Estimation2025-07-17AthleticsPose: Authentic Sports Motion Dataset on Athletic Field and Evaluation of Monocular 3D Pose Estimation Ability2025-07-17SpatialTrackerV2: 3D Point Tracking Made Easy2025-07-16SGLoc: Semantic Localization System for Camera Pose Estimation from 3D Gaussian Splatting Representation2025-07-16