TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Exploiting Temporal Contexts with Strided Transformer for ...

Exploiting Temporal Contexts with Strided Transformer for 3D Human Pose Estimation

Wenhao Li, Hong Liu, Runwei Ding, Mengyuan Liu, Pichao Wang, Wenming Yang

2021-03-263D Human Pose EstimationMonocular 3D Human Pose EstimationPose Estimation
PaperPDFCode(official)

Abstract

Despite the great progress in 3D human pose estimation from videos, it is still an open problem to take full advantage of a redundant 2D pose sequence to learn representative representations for generating one 3D pose. To this end, we propose an improved Transformer-based architecture, called Strided Transformer, which simply and effectively lifts a long sequence of 2D joint locations to a single 3D pose. Specifically, a Vanilla Transformer Encoder (VTE) is adopted to model long-range dependencies of 2D pose sequences. To reduce the redundancy of the sequence, fully-connected layers in the feed-forward network of VTE are replaced with strided convolutions to progressively shrink the sequence length and aggregate information from local contexts. The modified VTE is termed as Strided Transformer Encoder (STE), which is built upon the outputs of VTE. STE not only effectively aggregates long-range information to a single-vector representation in a hierarchical global and local fashion, but also significantly reduces the computation cost. Furthermore, a full-to-single supervision scheme is designed at both full sequence and single target frame scales applied to the outputs of VTE and STE, respectively. This scheme imposes extra temporal smoothness constraints in conjunction with the single target frame supervision and hence helps produce smoother and more accurate 3D poses. The proposed Strided Transformer is evaluated on two challenging benchmark datasets, Human3.6M and HumanEva-I, and achieves state-of-the-art results with fewer parameters. Code and models are available at \url{https://github.com/Vegetebird/StridedTransformer-Pose3D}.

Results

TaskDatasetMetricValueModel
3D Human Pose EstimationHumanEva-IMean Reconstruction Error (mm)12.2StridedTransformer (T=27 GT)
3D Human Pose EstimationHumanEva-IMean Reconstruction Error (mm)18.9StridedTransformer (T=27 MRCNN)
3D Human Pose EstimationHuman3.6MAverage MPJPE (mm)43.7StridedTransformer (T=351)
3D Human Pose EstimationHuman3.6MAverage MPJPE (mm)44StridedTransformer (T=243)
3D Human Pose EstimationHuman3.6MAverage MPJPE (mm)45.4StridedTransformer (T=81)
3D Human Pose EstimationHuman3.6MAverage MPJPE (mm)46.9StridedTransformer (T=27)
Pose EstimationHumanEva-IMean Reconstruction Error (mm)12.2StridedTransformer (T=27 GT)
Pose EstimationHumanEva-IMean Reconstruction Error (mm)18.9StridedTransformer (T=27 MRCNN)
Pose EstimationHuman3.6MAverage MPJPE (mm)43.7StridedTransformer (T=351)
Pose EstimationHuman3.6MAverage MPJPE (mm)44StridedTransformer (T=243)
Pose EstimationHuman3.6MAverage MPJPE (mm)45.4StridedTransformer (T=81)
Pose EstimationHuman3.6MAverage MPJPE (mm)46.9StridedTransformer (T=27)
3DHumanEva-IMean Reconstruction Error (mm)12.2StridedTransformer (T=27 GT)
3DHumanEva-IMean Reconstruction Error (mm)18.9StridedTransformer (T=27 MRCNN)
3DHuman3.6MAverage MPJPE (mm)43.7StridedTransformer (T=351)
3DHuman3.6MAverage MPJPE (mm)44StridedTransformer (T=243)
3DHuman3.6MAverage MPJPE (mm)45.4StridedTransformer (T=81)
3DHuman3.6MAverage MPJPE (mm)46.9StridedTransformer (T=27)
1 Image, 2*2 StitchiHumanEva-IMean Reconstruction Error (mm)12.2StridedTransformer (T=27 GT)
1 Image, 2*2 StitchiHumanEva-IMean Reconstruction Error (mm)18.9StridedTransformer (T=27 MRCNN)
1 Image, 2*2 StitchiHuman3.6MAverage MPJPE (mm)43.7StridedTransformer (T=351)
1 Image, 2*2 StitchiHuman3.6MAverage MPJPE (mm)44StridedTransformer (T=243)
1 Image, 2*2 StitchiHuman3.6MAverage MPJPE (mm)45.4StridedTransformer (T=81)
1 Image, 2*2 StitchiHuman3.6MAverage MPJPE (mm)46.9StridedTransformer (T=27)

Related Papers

$π^3$: Scalable Permutation-Equivariant Visual Geometry Learning2025-07-17Revisiting Reliability in the Reasoning-based Pose Estimation Benchmark2025-07-17DINO-VO: A Feature-based Visual Odometry Leveraging a Visual Foundation Model2025-07-17From Neck to Head: Bio-Impedance Sensing for Head Pose Estimation2025-07-17AthleticsPose: Authentic Sports Motion Dataset on Athletic Field and Evaluation of Monocular 3D Pose Estimation Ability2025-07-17SpatialTrackerV2: 3D Point Tracking Made Easy2025-07-16SGLoc: Semantic Localization System for Camera Pose Estimation from 3D Gaussian Splatting Representation2025-07-16Efficient Calisthenics Skills Classification through Foreground Instance Selection and Depth Estimation2025-07-16