TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human...

MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose Estimation in Video

Jinlu Zhang, Zhigang Tu, Jianyu Yang, Yujin Chen, Junsong Yuan

2022-03-02CVPR 2022 13D Human Pose EstimationMonocular 3D Human Pose EstimationPose EstimationClassification
PaperPDFCode(official)

Abstract

Recent transformer-based solutions have been introduced to estimate 3D human pose from 2D keypoint sequence by considering body joints among all frames globally to learn spatio-temporal correlation. We observe that the motions of different joints differ significantly. However, the previous methods cannot efficiently model the solid inter-frame correspondence of each joint, leading to insufficient learning of spatial-temporal correlation. We propose MixSTE (Mixed Spatio-Temporal Encoder), which has a temporal transformer block to separately model the temporal motion of each joint and a spatial transformer block to learn inter-joint spatial correlation. These two blocks are utilized alternately to obtain better spatio-temporal feature encoding. In addition, the network output is extended from the central frame to entire frames of the input video, thereby improving the coherence between the input and output sequences. Extensive experiments are conducted on three benchmarks (Human3.6M, MPI-INF-3DHP, and HumanEva). The results show that our model outperforms the state-of-the-art approach by 10.9% P-MPJPE and 7.6% MPJPE. The code is available at https://github.com/JinluZhang1126/MixSTE.

Results

TaskDatasetMetricValueModel
3D Human Pose EstimationHumanEva-IMean Reconstruction Error (mm)16.1MixSTE (T=43, FT)
3D Human Pose EstimationMPI-INF-3DHPAUC66.5MixSTE (T=27)
3D Human Pose EstimationMPI-INF-3DHPMPJPE54.9MixSTE (T=27)
3D Human Pose EstimationMPI-INF-3DHPPCK94.4MixSTE (T=27)
3D Human Pose EstimationMPI-INF-3DHPAUC63.8MixSTE (T=1)
3D Human Pose EstimationMPI-INF-3DHPMPJPE57.9MixSTE (T=1)
3D Human Pose EstimationMPI-INF-3DHPPCK94.2MixSTE (T=1)
3D Human Pose EstimationHuman3.6MAverage MPJPE (mm)39.8MixSTE (HRNet, T=243)
3D Human Pose EstimationHuman3.6MAverage MPJPE (mm)40.9MixSTE (CPN, T=243)
3D Human Pose EstimationHuman3.6MAverage MPJPE (mm)42.4MixSTE (CPN, T=81)
3D Human Pose EstimationHuman3.6MAverage MPJPE (mm)39.8MixSTE (HRNet, T=243)
3D Human Pose EstimationHuman3.6MFrames Needed243MixSTE (HRNet, T=243)
Pose EstimationHumanEva-IMean Reconstruction Error (mm)16.1MixSTE (T=43, FT)
Pose EstimationMPI-INF-3DHPAUC66.5MixSTE (T=27)
Pose EstimationMPI-INF-3DHPMPJPE54.9MixSTE (T=27)
Pose EstimationMPI-INF-3DHPPCK94.4MixSTE (T=27)
Pose EstimationMPI-INF-3DHPAUC63.8MixSTE (T=1)
Pose EstimationMPI-INF-3DHPMPJPE57.9MixSTE (T=1)
Pose EstimationMPI-INF-3DHPPCK94.2MixSTE (T=1)
Pose EstimationHuman3.6MAverage MPJPE (mm)39.8MixSTE (HRNet, T=243)
Pose EstimationHuman3.6MAverage MPJPE (mm)40.9MixSTE (CPN, T=243)
Pose EstimationHuman3.6MAverage MPJPE (mm)42.4MixSTE (CPN, T=81)
Pose EstimationHuman3.6MAverage MPJPE (mm)39.8MixSTE (HRNet, T=243)
Pose EstimationHuman3.6MFrames Needed243MixSTE (HRNet, T=243)
3DHumanEva-IMean Reconstruction Error (mm)16.1MixSTE (T=43, FT)
3DMPI-INF-3DHPAUC66.5MixSTE (T=27)
3DMPI-INF-3DHPMPJPE54.9MixSTE (T=27)
3DMPI-INF-3DHPPCK94.4MixSTE (T=27)
3DMPI-INF-3DHPAUC63.8MixSTE (T=1)
3DMPI-INF-3DHPMPJPE57.9MixSTE (T=1)
3DMPI-INF-3DHPPCK94.2MixSTE (T=1)
3DHuman3.6MAverage MPJPE (mm)39.8MixSTE (HRNet, T=243)
3DHuman3.6MAverage MPJPE (mm)40.9MixSTE (CPN, T=243)
3DHuman3.6MAverage MPJPE (mm)42.4MixSTE (CPN, T=81)
3DHuman3.6MAverage MPJPE (mm)39.8MixSTE (HRNet, T=243)
3DHuman3.6MFrames Needed243MixSTE (HRNet, T=243)
ClassificationFull-body Parkinson’s disease datasetF1-score (weighted)0.41Mixste
1 Image, 2*2 StitchiHumanEva-IMean Reconstruction Error (mm)16.1MixSTE (T=43, FT)
1 Image, 2*2 StitchiMPI-INF-3DHPAUC66.5MixSTE (T=27)
1 Image, 2*2 StitchiMPI-INF-3DHPMPJPE54.9MixSTE (T=27)
1 Image, 2*2 StitchiMPI-INF-3DHPPCK94.4MixSTE (T=27)
1 Image, 2*2 StitchiMPI-INF-3DHPAUC63.8MixSTE (T=1)
1 Image, 2*2 StitchiMPI-INF-3DHPMPJPE57.9MixSTE (T=1)
1 Image, 2*2 StitchiMPI-INF-3DHPPCK94.2MixSTE (T=1)
1 Image, 2*2 StitchiHuman3.6MAverage MPJPE (mm)39.8MixSTE (HRNet, T=243)
1 Image, 2*2 StitchiHuman3.6MAverage MPJPE (mm)40.9MixSTE (CPN, T=243)
1 Image, 2*2 StitchiHuman3.6MAverage MPJPE (mm)42.4MixSTE (CPN, T=81)
1 Image, 2*2 StitchiHuman3.6MAverage MPJPE (mm)39.8MixSTE (HRNet, T=243)
1 Image, 2*2 StitchiHuman3.6MFrames Needed243MixSTE (HRNet, T=243)

Related Papers

$π^3$: Scalable Permutation-Equivariant Visual Geometry Learning2025-07-17Revisiting Reliability in the Reasoning-based Pose Estimation Benchmark2025-07-17DINO-VO: A Feature-based Visual Odometry Leveraging a Visual Foundation Model2025-07-17From Neck to Head: Bio-Impedance Sensing for Head Pose Estimation2025-07-17AthleticsPose: Authentic Sports Motion Dataset on Athletic Field and Evaluation of Monocular 3D Pose Estimation Ability2025-07-17Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17SpatialTrackerV2: 3D Point Tracking Made Easy2025-07-16SGLoc: Semantic Localization System for Camera Pose Estimation from 3D Gaussian Splatting Representation2025-07-16