MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose Estimation in Video

Jinlu Zhang, Zhigang Tu, Jianyu Yang, Yujin Chen, Junsong Yuan

2022-03-02CVPR 2022 13D Human Pose Estimation Monocular 3D Human Pose Estimation Pose Estimation Classification

Abstract

Recent transformer-based solutions have been introduced to estimate 3D human pose from 2D keypoint sequence by considering body joints among all frames globally to learn spatio-temporal correlation. We observe that the motions of different joints differ significantly. However, the previous methods cannot efficiently model the solid inter-frame correspondence of each joint, leading to insufficient learning of spatial-temporal correlation. We propose MixSTE (Mixed Spatio-Temporal Encoder), which has a temporal transformer block to separately model the temporal motion of each joint and a spatial transformer block to learn inter-joint spatial correlation. These two blocks are utilized alternately to obtain better spatio-temporal feature encoding. In addition, the network output is extended from the central frame to entire frames of the input video, thereby improving the coherence between the input and output sequences. Extensive experiments are conducted on three benchmarks (Human3.6M, MPI-INF-3DHP, and HumanEva). The results show that our model outperforms the state-of-the-art approach by 10.9% P-MPJPE and 7.6% MPJPE. The code is available at https://github.com/JinluZhang1126/MixSTE.

Results

Task	Dataset	Metric	Value	Model
3D Human Pose Estimation	HumanEva-I	Mean Reconstruction Error (mm)	16.1	MixSTE (T=43, FT)
3D Human Pose Estimation	MPI-INF-3DHP	AUC	66.5	MixSTE (T=27)
3D Human Pose Estimation	MPI-INF-3DHP	MPJPE	54.9	MixSTE (T=27)
3D Human Pose Estimation	MPI-INF-3DHP	PCK	94.4	MixSTE (T=27)
3D Human Pose Estimation	MPI-INF-3DHP	AUC	63.8	MixSTE (T=1)
3D Human Pose Estimation	MPI-INF-3DHP	MPJPE	57.9	MixSTE (T=1)
3D Human Pose Estimation	MPI-INF-3DHP	PCK	94.2	MixSTE (T=1)
3D Human Pose Estimation	Human3.6M	Average MPJPE (mm)	39.8	MixSTE (HRNet, T=243)
3D Human Pose Estimation	Human3.6M	Average MPJPE (mm)	40.9	MixSTE (CPN, T=243)
3D Human Pose Estimation	Human3.6M	Average MPJPE (mm)	42.4	MixSTE (CPN, T=81)
3D Human Pose Estimation	Human3.6M	Average MPJPE (mm)	39.8	MixSTE (HRNet, T=243)
3D Human Pose Estimation	Human3.6M	Frames Needed	243	MixSTE (HRNet, T=243)
Pose Estimation	HumanEva-I	Mean Reconstruction Error (mm)	16.1	MixSTE (T=43, FT)
Pose Estimation	MPI-INF-3DHP	AUC	66.5	MixSTE (T=27)
Pose Estimation	MPI-INF-3DHP	MPJPE	54.9	MixSTE (T=27)
Pose Estimation	MPI-INF-3DHP	PCK	94.4	MixSTE (T=27)
Pose Estimation	MPI-INF-3DHP	AUC	63.8	MixSTE (T=1)
Pose Estimation	MPI-INF-3DHP	MPJPE	57.9	MixSTE (T=1)
Pose Estimation	MPI-INF-3DHP	PCK	94.2	MixSTE (T=1)
Pose Estimation	Human3.6M	Average MPJPE (mm)	39.8	MixSTE (HRNet, T=243)
Pose Estimation	Human3.6M	Average MPJPE (mm)	40.9	MixSTE (CPN, T=243)
Pose Estimation	Human3.6M	Average MPJPE (mm)	42.4	MixSTE (CPN, T=81)
Pose Estimation	Human3.6M	Average MPJPE (mm)	39.8	MixSTE (HRNet, T=243)
Pose Estimation	Human3.6M	Frames Needed	243	MixSTE (HRNet, T=243)
3D	HumanEva-I	Mean Reconstruction Error (mm)	16.1	MixSTE (T=43, FT)
3D	MPI-INF-3DHP	AUC	66.5	MixSTE (T=27)
3D	MPI-INF-3DHP	MPJPE	54.9	MixSTE (T=27)
3D	MPI-INF-3DHP	PCK	94.4	MixSTE (T=27)
3D	MPI-INF-3DHP	AUC	63.8	MixSTE (T=1)
3D	MPI-INF-3DHP	MPJPE	57.9	MixSTE (T=1)
3D	MPI-INF-3DHP	PCK	94.2	MixSTE (T=1)
3D	Human3.6M	Average MPJPE (mm)	39.8	MixSTE (HRNet, T=243)
3D	Human3.6M	Average MPJPE (mm)	40.9	MixSTE (CPN, T=243)
3D	Human3.6M	Average MPJPE (mm)	42.4	MixSTE (CPN, T=81)
3D	Human3.6M	Average MPJPE (mm)	39.8	MixSTE (HRNet, T=243)
3D	Human3.6M	Frames Needed	243	MixSTE (HRNet, T=243)
Classification	Full-body Parkinson’s disease dataset	F1-score (weighted)	0.41	Mixste
1 Image, 2*2 Stitchi	HumanEva-I	Mean Reconstruction Error (mm)	16.1	MixSTE (T=43, FT)
1 Image, 2*2 Stitchi	MPI-INF-3DHP	AUC	66.5	MixSTE (T=27)
1 Image, 2*2 Stitchi	MPI-INF-3DHP	MPJPE	54.9	MixSTE (T=27)
1 Image, 2*2 Stitchi	MPI-INF-3DHP	PCK	94.4	MixSTE (T=27)
1 Image, 2*2 Stitchi	MPI-INF-3DHP	AUC	63.8	MixSTE (T=1)
1 Image, 2*2 Stitchi	MPI-INF-3DHP	MPJPE	57.9	MixSTE (T=1)
1 Image, 2*2 Stitchi	MPI-INF-3DHP	PCK	94.2	MixSTE (T=1)
1 Image, 2*2 Stitchi	Human3.6M	Average MPJPE (mm)	39.8	MixSTE (HRNet, T=243)
1 Image, 2*2 Stitchi	Human3.6M	Average MPJPE (mm)	40.9	MixSTE (CPN, T=243)
1 Image, 2*2 Stitchi	Human3.6M	Average MPJPE (mm)	42.4	MixSTE (CPN, T=81)
1 Image, 2*2 Stitchi	Human3.6M	Average MPJPE (mm)	39.8	MixSTE (HRNet, T=243)
1 Image, 2*2 Stitchi	Human3.6M	Frames Needed	243	MixSTE (HRNet, T=243)

MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose Estimation in Video

Abstract

Results

Related Papers

MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose Estimation in Video

Abstract

Results

Related Papers