TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/IVT: An End-to-End Instance-guided Video Transformer for 3...

IVT: An End-to-End Instance-guided Video Transformer for 3D Pose Estimation

Zhongwei Qiu, Qiansheng Yang, Jian Wang, Dongmei Fu

2022-08-063D Human Pose EstimationPose Estimation3D Pose Estimation2D Pose Estimation3D Multi-Person Pose Estimation
PaperPDF

Abstract

Video 3D human pose estimation aims to localize the 3D coordinates of human joints from videos. Recent transformer-based approaches focus on capturing the spatiotemporal information from sequential 2D poses, which cannot model the contextual depth feature effectively since the visual depth features are lost in the step of 2D pose estimation. In this paper, we simplify the paradigm into an end-to-end framework, Instance-guided Video Transformer (IVT), which enables learning spatiotemporal contextual depth information from visual features effectively and predicts 3D poses directly from video frames. In particular, we firstly formulate video frames as a series of instance-guided tokens and each token is in charge of predicting the 3D pose of a human instance. These tokens contain body structure information since they are extracted by the guidance of joint offsets from the human center to the corresponding body joints. Then, these tokens are sent into IVT for learning spatiotemporal contextual depth. In addition, we propose a cross-scale instance-guided attention mechanism to handle the variational scales among multiple persons. Finally, the 3D poses of each person are decoded from instance-guided tokens by coordinate regression. Experiments on three widely-used 3D pose estimation benchmarks show that the proposed IVT achieves state-of-the-art performances.

Results

TaskDatasetMetricValueModel
3D Human Pose Estimation3DPWPA-MPJPE46IVT (f=5)
3D Human Pose EstimationHuman3.6MAverage MPJPE (mm)40.2IVT (f=5)
3D Human Pose EstimationPanopticAverage MPJPE (mm)48.4IVT (f=5)
Pose Estimation3DPWPA-MPJPE46IVT (f=5)
Pose EstimationHuman3.6MAverage MPJPE (mm)40.2IVT (f=5)
Pose EstimationPanopticAverage MPJPE (mm)48.4IVT (f=5)
3D3DPWPA-MPJPE46IVT (f=5)
3DHuman3.6MAverage MPJPE (mm)40.2IVT (f=5)
3DPanopticAverage MPJPE (mm)48.4IVT (f=5)
3D Multi-Person Pose EstimationPanopticAverage MPJPE (mm)48.4IVT (f=5)
1 Image, 2*2 Stitchi3DPWPA-MPJPE46IVT (f=5)
1 Image, 2*2 StitchiHuman3.6MAverage MPJPE (mm)40.2IVT (f=5)
1 Image, 2*2 StitchiPanopticAverage MPJPE (mm)48.4IVT (f=5)

Related Papers

$π^3$: Scalable Permutation-Equivariant Visual Geometry Learning2025-07-17Revisiting Reliability in the Reasoning-based Pose Estimation Benchmark2025-07-17DINO-VO: A Feature-based Visual Odometry Leveraging a Visual Foundation Model2025-07-17From Neck to Head: Bio-Impedance Sensing for Head Pose Estimation2025-07-17AthleticsPose: Authentic Sports Motion Dataset on Athletic Field and Evaluation of Monocular 3D Pose Estimation Ability2025-07-17SpatialTrackerV2: 3D Point Tracking Made Easy2025-07-16SGLoc: Semantic Localization System for Camera Pose Estimation from 3D Gaussian Splatting Representation2025-07-16Efficient Calisthenics Skills Classification through Foreground Instance Selection and Depth Estimation2025-07-16