IVT: An End-to-End Instance-guided Video Transformer for 3D Pose Estimation

Zhongwei Qiu, Qiansheng Yang, Jian Wang, Dongmei Fu

2022-08-063D Human Pose Estimation Pose Estimation 3D Pose Estimation 2D Pose Estimation 3D Multi-Person Pose Estimation

Paper PDF

Abstract

Video 3D human pose estimation aims to localize the 3D coordinates of human joints from videos. Recent transformer-based approaches focus on capturing the spatiotemporal information from sequential 2D poses, which cannot model the contextual depth feature effectively since the visual depth features are lost in the step of 2D pose estimation. In this paper, we simplify the paradigm into an end-to-end framework, Instance-guided Video Transformer (IVT), which enables learning spatiotemporal contextual depth information from visual features effectively and predicts 3D poses directly from video frames. In particular, we firstly formulate video frames as a series of instance-guided tokens and each token is in charge of predicting the 3D pose of a human instance. These tokens contain body structure information since they are extracted by the guidance of joint offsets from the human center to the corresponding body joints. Then, these tokens are sent into IVT for learning spatiotemporal contextual depth. In addition, we propose a cross-scale instance-guided attention mechanism to handle the variational scales among multiple persons. Finally, the 3D poses of each person are decoded from instance-guided tokens by coordinate regression. Experiments on three widely-used 3D pose estimation benchmarks show that the proposed IVT achieves state-of-the-art performances.

Results

Task	Dataset	Metric	Value	Model
3D Human Pose Estimation	3DPW	PA-MPJPE	46	IVT (f=5)
3D Human Pose Estimation	Human3.6M	Average MPJPE (mm)	40.2	IVT (f=5)
3D Human Pose Estimation	Panoptic	Average MPJPE (mm)	48.4	IVT (f=5)
Pose Estimation	3DPW	PA-MPJPE	46	IVT (f=5)
Pose Estimation	Human3.6M	Average MPJPE (mm)	40.2	IVT (f=5)
Pose Estimation	Panoptic	Average MPJPE (mm)	48.4	IVT (f=5)
3D	3DPW	PA-MPJPE	46	IVT (f=5)
3D	Human3.6M	Average MPJPE (mm)	40.2	IVT (f=5)
3D	Panoptic	Average MPJPE (mm)	48.4	IVT (f=5)
3D Multi-Person Pose Estimation	Panoptic	Average MPJPE (mm)	48.4	IVT (f=5)
1 Image, 2*2 Stitchi	3DPW	PA-MPJPE	46	IVT (f=5)
1 Image, 2*2 Stitchi	Human3.6M	Average MPJPE (mm)	40.2	IVT (f=5)
1 Image, 2*2 Stitchi	Panoptic	Average MPJPE (mm)	48.4	IVT (f=5)

IVT: An End-to-End Instance-guided Video Transformer for 3D Pose Estimation

Abstract

Results

Related Papers

IVT: An End-to-End Instance-guided Video Transformer for 3D Pose Estimation

Abstract

Results

Related Papers