ViTPose++: Vision Transformer for Generic Body Pose Estimation

Yufei Xu, Jing Zhang, Qiming Zhang, DaCheng Tao

2022-12-072D Human Pose Estimation Pose Estimation Keypoint Detection Animal Pose Estimation

Abstract

In this paper, we show the surprisingly good properties of plain vision transformers for body pose estimation from various aspects, namely simplicity in model structure, scalability in model size, flexibility in training paradigm, and transferability of knowledge between models, through a simple baseline model dubbed ViTPose. Specifically, ViTPose employs the plain and non-hierarchical vision transformer as an encoder to encode features and a lightweight decoder to decode body keypoints in either a top-down or a bottom-up manner. It can be scaled up from about 20M to 1B parameters by taking advantage of the scalable model capacity and high parallelism of the vision transformer, setting a new Pareto front for throughput and performance. Besides, ViTPose is very flexible regarding the attention type, input resolution, and pre-training and fine-tuning strategy. Based on the flexibility, a novel ViTPose+ model is proposed to deal with heterogeneous body keypoint categories in different types of body pose estimation tasks via knowledge factorization, i.e., adopting task-agnostic and task-specific feed-forward networks in the transformer. We also empirically demonstrate that the knowledge of large ViTPose models can be easily transferred to small ones via a simple knowledge token. Experimental results show that our ViTPose model outperforms representative methods on the challenging MS COCO Human Keypoint Detection benchmark at both top-down and bottom-up settings. Furthermore, our ViTPose+ model achieves state-of-the-art performance simultaneously on a series of body pose estimation tasks, including MS COCO, AI Challenger, OCHuman, MPII for human keypoint detection, COCO-Wholebody for whole-body keypoint detection, as well as AP-10K and APT-36K for animal keypoint detection, without sacrificing inference speed.

Results

Task	Dataset	Metric	Value	Model
Pose Estimation	AP-10K	AP	82.4	ViTPose+-H
Pose Estimation	AP-10K	AP	80.4	ViTPose+-L
Pose Estimation	AP-10K	AP	74.5	ViTPose+-B
Pose Estimation	AP-10K	AP	73.1	HRNet-w48
Pose Estimation	AP-10K	AP	72.2	HRNet-w32
Pose Estimation	AP-10K	AP	71.4	ViTPose+-S ViT-S
Pose Estimation	AP-10K	AP	68.1	SimpleBaseline-ResNet50
3D	AP-10K	AP	82.4	ViTPose+-H
3D	AP-10K	AP	80.4	ViTPose+-L
3D	AP-10K	AP	74.5	ViTPose+-B
3D	AP-10K	AP	73.1	HRNet-w48
3D	AP-10K	AP	72.2	HRNet-w32
3D	AP-10K	AP	71.4	ViTPose+-S ViT-S
3D	AP-10K	AP	68.1	SimpleBaseline-ResNet50
Animal Pose Estimation	AP-10K	AP	82.4	ViTPose+-H
Animal Pose Estimation	AP-10K	AP	80.4	ViTPose+-L
Animal Pose Estimation	AP-10K	AP	74.5	ViTPose+-B
Animal Pose Estimation	AP-10K	AP	73.1	HRNet-w48
Animal Pose Estimation	AP-10K	AP	72.2	HRNet-w32
Animal Pose Estimation	AP-10K	AP	71.4	ViTPose+-S ViT-S
Animal Pose Estimation	AP-10K	AP	68.1	SimpleBaseline-ResNet50
2D Human Pose Estimation	COCO-WholeBody	WB	61.2	ViTPose+-H
2D Human Pose Estimation	COCO-WholeBody	body	75.9	ViTPose+-H
2D Human Pose Estimation	COCO-WholeBody	face	63.3	ViTPose+-H
2D Human Pose Estimation	COCO-WholeBody	foot	77.9	ViTPose+-H
2D Human Pose Estimation	COCO-WholeBody	hand	54.7	ViTPose+-H
1 Image, 2*2 Stitchi	AP-10K	AP	82.4	ViTPose+-H
1 Image, 2*2 Stitchi	AP-10K	AP	80.4	ViTPose+-L
1 Image, 2*2 Stitchi	AP-10K	AP	74.5	ViTPose+-B
1 Image, 2*2 Stitchi	AP-10K	AP	73.1	HRNet-w48
1 Image, 2*2 Stitchi	AP-10K	AP	72.2	HRNet-w32
1 Image, 2*2 Stitchi	AP-10K	AP	71.4	ViTPose+-S ViT-S
1 Image, 2*2 Stitchi	AP-10K	AP	68.1	SimpleBaseline-ResNet50

Abstract

Results

Task	Dataset	Metric	Value	Model
Pose Estimation	AP-10K	AP	82.4	ViTPose+-H
Pose Estimation	AP-10K	AP	80.4	ViTPose+-L
Pose Estimation	AP-10K	AP	74.5	ViTPose+-B
Pose Estimation	AP-10K	AP	73.1	HRNet-w48
Pose Estimation	AP-10K	AP	72.2	HRNet-w32
Pose Estimation	AP-10K	AP	71.4	ViTPose+-S ViT-S
Pose Estimation	AP-10K	AP	68.1	SimpleBaseline-ResNet50
3D	AP-10K	AP	82.4	ViTPose+-H
3D	AP-10K	AP	80.4	ViTPose+-L
3D	AP-10K	AP	74.5	ViTPose+-B
3D	AP-10K	AP	73.1	HRNet-w48
3D	AP-10K	AP	72.2	HRNet-w32
3D	AP-10K	AP	71.4	ViTPose+-S ViT-S
3D	AP-10K	AP	68.1	SimpleBaseline-ResNet50
Animal Pose Estimation	AP-10K	AP	82.4	ViTPose+-H
Animal Pose Estimation	AP-10K	AP	80.4	ViTPose+-L
Animal Pose Estimation	AP-10K	AP	74.5	ViTPose+-B
Animal Pose Estimation	AP-10K	AP	73.1	HRNet-w48
Animal Pose Estimation	AP-10K	AP	72.2	HRNet-w32
Animal Pose Estimation	AP-10K	AP	71.4	ViTPose+-S ViT-S
Animal Pose Estimation	AP-10K	AP	68.1	SimpleBaseline-ResNet50
2D Human Pose Estimation	COCO-WholeBody	WB	61.2	ViTPose+-H
2D Human Pose Estimation	COCO-WholeBody	body	75.9	ViTPose+-H
2D Human Pose Estimation	COCO-WholeBody	face	63.3	ViTPose+-H
2D Human Pose Estimation	COCO-WholeBody	foot	77.9	ViTPose+-H
2D Human Pose Estimation	COCO-WholeBody	hand	54.7	ViTPose+-H
1 Image, 2*2 Stitchi	AP-10K	AP	82.4	ViTPose+-H
1 Image, 2*2 Stitchi	AP-10K	AP	80.4	ViTPose+-L
1 Image, 2*2 Stitchi	AP-10K	AP	74.5	ViTPose+-B
1 Image, 2*2 Stitchi	AP-10K	AP	73.1	HRNet-w48
1 Image, 2*2 Stitchi	AP-10K	AP	72.2	HRNet-w32
1 Image, 2*2 Stitchi	AP-10K	AP	71.4	ViTPose+-S ViT-S
1 Image, 2*2 Stitchi	AP-10K	AP	68.1	SimpleBaseline-ResNet50

ViTPose++: Vision Transformer for Generic Body Pose Estimation

Abstract

Results

Related Papers

ViTPose++: Vision Transformer for Generic Body Pose Estimation

Abstract

Results

Related Papers