TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/P-STMO: Pre-Trained Spatial Temporal Many-to-One Model for...

P-STMO: Pre-Trained Spatial Temporal Many-to-One Model for 3D Human Pose Estimation

Wenkang Shan, Zhenhua Liu, Xinfeng Zhang, Shanshe Wang, Siwei Ma, Wen Gao

2022-03-15Denoising3D Human Pose EstimationMonocular 3D Human Pose EstimationPose Estimation
PaperPDFCode(official)

Abstract

This paper introduces a novel Pre-trained Spatial Temporal Many-to-One (P-STMO) model for 2D-to-3D human pose estimation task. To reduce the difficulty of capturing spatial and temporal information, we divide this task into two stages: pre-training (Stage I) and fine-tuning (Stage II). In Stage I, a self-supervised pre-training sub-task, termed masked pose modeling, is proposed. The human joints in the input sequence are randomly masked in both spatial and temporal domains. A general form of denoising auto-encoder is exploited to recover the original 2D poses and the encoder is capable of capturing spatial and temporal dependencies in this way. In Stage II, the pre-trained encoder is loaded to STMO model and fine-tuned. The encoder is followed by a many-to-one frame aggregator to predict the 3D pose in the current frame. Especially, an MLP block is utilized as the spatial feature extractor in STMO, which yields better performance than other methods. In addition, a temporal downsampling strategy is proposed to diminish data redundancy. Extensive experiments on two benchmarks show that our method outperforms state-of-the-art methods with fewer parameters and less computational overhead. For example, our P-STMO model achieves 42.1mm MPJPE on Human3.6M dataset when using 2D poses from CPN as inputs. Meanwhile, it brings a 1.5-7.1 times speedup to state-of-the-art methods. Code is available at https://github.com/paTRICK-swk/P-STMO.

Results

TaskDatasetMetricValueModel
3D Human Pose EstimationMPI-INF-3DHPAUC75.8P-STMO (N=81)
3D Human Pose EstimationMPI-INF-3DHPMPJPE32.2P-STMO (N=81)
3D Human Pose EstimationMPI-INF-3DHPPCK97.9P-STMO (N=81)
3D Human Pose EstimationHuman3.6MAverage MPJPE (mm)42.1P-STMO (N=243)
3D Human Pose EstimationHuman3.6MPA-MPJPE34.4P-STMO (N=243)
3D Human Pose EstimationHuman3.6MAverage MPJPE (mm)44.1P-STMO-S (N=81)
3D Human Pose EstimationHuman3.6MAverage MPJPE (mm)42.1P-STMO (N=243)
3D Human Pose EstimationHuman3.6MFrames Needed243P-STMO (N=243)
Pose EstimationMPI-INF-3DHPAUC75.8P-STMO (N=81)
Pose EstimationMPI-INF-3DHPMPJPE32.2P-STMO (N=81)
Pose EstimationMPI-INF-3DHPPCK97.9P-STMO (N=81)
Pose EstimationHuman3.6MAverage MPJPE (mm)42.1P-STMO (N=243)
Pose EstimationHuman3.6MPA-MPJPE34.4P-STMO (N=243)
Pose EstimationHuman3.6MAverage MPJPE (mm)44.1P-STMO-S (N=81)
Pose EstimationHuman3.6MAverage MPJPE (mm)42.1P-STMO (N=243)
Pose EstimationHuman3.6MFrames Needed243P-STMO (N=243)
3DMPI-INF-3DHPAUC75.8P-STMO (N=81)
3DMPI-INF-3DHPMPJPE32.2P-STMO (N=81)
3DMPI-INF-3DHPPCK97.9P-STMO (N=81)
3DHuman3.6MAverage MPJPE (mm)42.1P-STMO (N=243)
3DHuman3.6MPA-MPJPE34.4P-STMO (N=243)
3DHuman3.6MAverage MPJPE (mm)44.1P-STMO-S (N=81)
3DHuman3.6MAverage MPJPE (mm)42.1P-STMO (N=243)
3DHuman3.6MFrames Needed243P-STMO (N=243)
1 Image, 2*2 StitchiMPI-INF-3DHPAUC75.8P-STMO (N=81)
1 Image, 2*2 StitchiMPI-INF-3DHPMPJPE32.2P-STMO (N=81)
1 Image, 2*2 StitchiMPI-INF-3DHPPCK97.9P-STMO (N=81)
1 Image, 2*2 StitchiHuman3.6MAverage MPJPE (mm)42.1P-STMO (N=243)
1 Image, 2*2 StitchiHuman3.6MPA-MPJPE34.4P-STMO (N=243)
1 Image, 2*2 StitchiHuman3.6MAverage MPJPE (mm)44.1P-STMO-S (N=81)
1 Image, 2*2 StitchiHuman3.6MAverage MPJPE (mm)42.1P-STMO (N=243)
1 Image, 2*2 StitchiHuman3.6MFrames Needed243P-STMO (N=243)

Related Papers

fastWDM3D: Fast and Accurate 3D Healthy Tissue Inpainting2025-07-17Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models2025-07-17$π^3$: Scalable Permutation-Equivariant Visual Geometry Learning2025-07-17Revisiting Reliability in the Reasoning-based Pose Estimation Benchmark2025-07-17DINO-VO: A Feature-based Visual Odometry Leveraging a Visual Foundation Model2025-07-17From Neck to Head: Bio-Impedance Sensing for Head Pose Estimation2025-07-17AthleticsPose: Authentic Sports Motion Dataset on Athletic Field and Evaluation of Monocular 3D Pose Estimation Ability2025-07-17Similarity-Guided Diffusion for Contrastive Sequential Recommendation2025-07-16