TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Global-to-Local Modeling for Video-based 3D Human Pose and...

Global-to-Local Modeling for Video-based 3D Human Pose and Shape Estimation

Xiaolong Shen, Zongxin Yang, Xiaohan Wang, Jianxin Ma, Chang Zhou, Yi Yang

2023-03-26CVPR 2023 13D Human Pose Estimation3D human pose and shape estimation
PaperPDFCode(official)

Abstract

Video-based 3D human pose and shape estimations are evaluated by intra-frame accuracy and inter-frame smoothness. Although these two metrics are responsible for different ranges of temporal consistency, existing state-of-the-art methods treat them as a unified problem and use monotonous modeling structures (e.g., RNN or attention-based block) to design their networks. However, using a single kind of modeling structure is difficult to balance the learning of short-term and long-term temporal correlations, and may bias the network to one of them, leading to undesirable predictions like global location shift, temporal inconsistency, and insufficient local details. To solve these problems, we propose to structurally decouple the modeling of long-term and short-term correlations in an end-to-end framework, Global-to-Local Transformer (GLoT). First, a global transformer is introduced with a Masked Pose and Shape Estimation strategy for long-term modeling. The strategy stimulates the global transformer to learn more inter-frame correlations by randomly masking the features of several frames. Second, a local transformer is responsible for exploiting local details on the human mesh and interacting with the global transformer by leveraging cross-attention. Moreover, a Hierarchical Spatial Correlation Regressor is further introduced to refine intra-frame estimations by decoupled global-local representation and implicit kinematic constraints. Our GLoT surpasses previous state-of-the-art methods with the lowest model parameters on popular benchmarks, i.e., 3DPW, MPI-INF-3DHP, and Human3.6M. Codes are available at https://github.com/sxl142/GLoT.

Results

TaskDatasetMetricValueModel
3D Human Pose EstimationMPI-INF-3DHPAcceleration Error7.9GLoT
3D Human Pose EstimationMPI-INF-3DHPMPJPE93.9GLoT
3D Human Pose EstimationMPI-INF-3DHPPA-MPJPE61.5GLoT
3D Human Pose Estimation3DPWAcceleration Error6.6GLoT
3D Human Pose Estimation3DPWMPJPE80.7GLoT
3D Human Pose Estimation3DPWMPVPE96.3GLoT
3D Human Pose Estimation3DPWPA-MPJPE50.6GLoT
Pose EstimationMPI-INF-3DHPAcceleration Error7.9GLoT
Pose EstimationMPI-INF-3DHPMPJPE93.9GLoT
Pose EstimationMPI-INF-3DHPPA-MPJPE61.5GLoT
Pose Estimation3DPWAcceleration Error6.6GLoT
Pose Estimation3DPWMPJPE80.7GLoT
Pose Estimation3DPWMPVPE96.3GLoT
Pose Estimation3DPWPA-MPJPE50.6GLoT
3DMPI-INF-3DHPAcceleration Error7.9GLoT
3DMPI-INF-3DHPMPJPE93.9GLoT
3DMPI-INF-3DHPPA-MPJPE61.5GLoT
3D3DPWAcceleration Error6.6GLoT
3D3DPWMPJPE80.7GLoT
3D3DPWMPVPE96.3GLoT
3D3DPWPA-MPJPE50.6GLoT
1 Image, 2*2 StitchiMPI-INF-3DHPAcceleration Error7.9GLoT
1 Image, 2*2 StitchiMPI-INF-3DHPMPJPE93.9GLoT
1 Image, 2*2 StitchiMPI-INF-3DHPPA-MPJPE61.5GLoT
1 Image, 2*2 Stitchi3DPWAcceleration Error6.6GLoT
1 Image, 2*2 Stitchi3DPWMPJPE80.7GLoT
1 Image, 2*2 Stitchi3DPWMPVPE96.3GLoT
1 Image, 2*2 Stitchi3DPWPA-MPJPE50.6GLoT

Related Papers

Systematic Comparison of Projection Methods for Monocular 3D Human Pose Estimation on Fisheye Images2025-06-24ExtPose: Robust and Coherent Pose Estimation by Extending ViTs2025-06-18PoseGRAF: Geometric-Reinforced Adaptive Fusion for Monocular 3D Human Pose Estimation2025-06-17Learning Pyramid-structured Long-range Dependencies for 3D Human Pose Estimation2025-06-03UPTor: Unified 3D Human Pose Dynamics and Trajectory Prediction for Human-Robot Interaction2025-05-20PoseBench3D: A Cross-Dataset Analysis Framework for 3D Human Pose Estimation2025-05-16HDiffTG: A Lightweight Hybrid Diffusion-Transformer-GCN Architecture for 3D Human Pose Estimation2025-05-07Continuous Normalizing Flows for Uncertainty-Aware Human Pose Estimation2025-05-04