ConvFormer: Parameter Reduction in Transformer Models for 3D Human Pose Estimation by Leveraging Dynamic Multi-Headed Convolutional Attention

Alec Diaz-Arias, Dmitriy Shin

2023-04-043D Human Pose Estimation Monocular 3D Human Pose Estimation Pose Estimation

Abstract

Recently, fully-transformer architectures have replaced the defacto convolutional architecture for the 3D human pose estimation task. In this paper we propose \textbf{\textit{ConvFormer}}, a novel convolutional transformer that leverages a new \textbf{\textit{dynamic multi-headed convolutional self-attention}} mechanism for monocular 3D human pose estimation. We designed a spatial and temporal convolutional transformer to comprehensively model human joint relations within individual frames and globally across the motion sequence. Moreover, we introduce a novel notion of \textbf{\textit{temporal joints profile}} for our temporal ConvFormer that fuses complete temporal information immediately for a local neighborhood of joint features. We have quantitatively and qualitatively validated our method on three common benchmark datasets: Human3.6M, MPI-INF-3DHP, and HumanEva. Extensive experiments have been conducted to identify the optimal hyper-parameter set. These experiments demonstrated that we achieved a \textbf{significant parameter reduction relative to prior transformer models} while attaining State-of-the-Art (SOTA) or near SOTA on all three datasets. Additionally, we achieved SOTA for Protocol III on H36M for both GT and CPN detection inputs. Finally, we obtained SOTA on all three metrics for the MPI-INF-3DHP dataset and for all three subjects on HumanEva under Protocol II.

Results

Task	Dataset	Metric	Value	Model
3D Human Pose Estimation	HumanEva-I	Mean Reconstruction Error (mm)	24.3	ConvFormer (T=43)
3D Human Pose Estimation	MPI-INF-3DHP	AUC	69.8	ConvFormer
3D Human Pose Estimation	MPI-INF-3DHP	MPJPE	53.6	ConvFormer
3D Human Pose Estimation	MPI-INF-3DHP	PCK	96.4	ConvFormer
3D Human Pose Estimation	Human3.6M	Average MPJPE (mm)	43.2	ConvFormer (T=243, CPN)
Pose Estimation	HumanEva-I	Mean Reconstruction Error (mm)	24.3	ConvFormer (T=43)
Pose Estimation	MPI-INF-3DHP	AUC	69.8	ConvFormer
Pose Estimation	MPI-INF-3DHP	MPJPE	53.6	ConvFormer
Pose Estimation	MPI-INF-3DHP	PCK	96.4	ConvFormer
Pose Estimation	Human3.6M	Average MPJPE (mm)	43.2	ConvFormer (T=243, CPN)
3D	HumanEva-I	Mean Reconstruction Error (mm)	24.3	ConvFormer (T=43)
3D	MPI-INF-3DHP	AUC	69.8	ConvFormer
3D	MPI-INF-3DHP	MPJPE	53.6	ConvFormer
3D	MPI-INF-3DHP	PCK	96.4	ConvFormer
3D	Human3.6M	Average MPJPE (mm)	43.2	ConvFormer (T=243, CPN)
1 Image, 2*2 Stitchi	HumanEva-I	Mean Reconstruction Error (mm)	24.3	ConvFormer (T=43)
1 Image, 2*2 Stitchi	MPI-INF-3DHP	AUC	69.8	ConvFormer
1 Image, 2*2 Stitchi	MPI-INF-3DHP	MPJPE	53.6	ConvFormer
1 Image, 2*2 Stitchi	MPI-INF-3DHP	PCK	96.4	ConvFormer
1 Image, 2*2 Stitchi	Human3.6M	Average MPJPE (mm)	43.2	ConvFormer (T=243, CPN)

ConvFormer: Parameter Reduction in Transformer Models for 3D Human Pose Estimation by Leveraging Dynamic Multi-Headed Convolutional Attention

Abstract

Results

Related Papers

ConvFormer: Parameter Reduction in Transformer Models for 3D Human Pose Estimation by Leveraging Dynamic Multi-Headed Convolutional Attention

Abstract

Results

Related Papers