TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/ViTPose++: Vision Transformer for Generic Body Pose Estima...

ViTPose++: Vision Transformer for Generic Body Pose Estimation

Yufei Xu, Jing Zhang, Qiming Zhang, DaCheng Tao

2022-12-072D Human Pose EstimationPose EstimationKeypoint DetectionAnimal Pose Estimation
PaperPDFCode(official)Code

Abstract

In this paper, we show the surprisingly good properties of plain vision transformers for body pose estimation from various aspects, namely simplicity in model structure, scalability in model size, flexibility in training paradigm, and transferability of knowledge between models, through a simple baseline model dubbed ViTPose. Specifically, ViTPose employs the plain and non-hierarchical vision transformer as an encoder to encode features and a lightweight decoder to decode body keypoints in either a top-down or a bottom-up manner. It can be scaled up from about 20M to 1B parameters by taking advantage of the scalable model capacity and high parallelism of the vision transformer, setting a new Pareto front for throughput and performance. Besides, ViTPose is very flexible regarding the attention type, input resolution, and pre-training and fine-tuning strategy. Based on the flexibility, a novel ViTPose+ model is proposed to deal with heterogeneous body keypoint categories in different types of body pose estimation tasks via knowledge factorization, i.e., adopting task-agnostic and task-specific feed-forward networks in the transformer. We also empirically demonstrate that the knowledge of large ViTPose models can be easily transferred to small ones via a simple knowledge token. Experimental results show that our ViTPose model outperforms representative methods on the challenging MS COCO Human Keypoint Detection benchmark at both top-down and bottom-up settings. Furthermore, our ViTPose+ model achieves state-of-the-art performance simultaneously on a series of body pose estimation tasks, including MS COCO, AI Challenger, OCHuman, MPII for human keypoint detection, COCO-Wholebody for whole-body keypoint detection, as well as AP-10K and APT-36K for animal keypoint detection, without sacrificing inference speed.

Results

TaskDatasetMetricValueModel
Pose EstimationAP-10KAP82.4ViTPose+-H
Pose EstimationAP-10KAP80.4ViTPose+-L
Pose EstimationAP-10KAP74.5ViTPose+-B
Pose EstimationAP-10KAP73.1HRNet-w48
Pose EstimationAP-10KAP72.2HRNet-w32
Pose EstimationAP-10KAP71.4ViTPose+-S ViT-S
Pose EstimationAP-10KAP68.1SimpleBaseline-ResNet50
3DAP-10KAP82.4ViTPose+-H
3DAP-10KAP80.4ViTPose+-L
3DAP-10KAP74.5ViTPose+-B
3DAP-10KAP73.1HRNet-w48
3DAP-10KAP72.2HRNet-w32
3DAP-10KAP71.4ViTPose+-S ViT-S
3DAP-10KAP68.1SimpleBaseline-ResNet50
Animal Pose EstimationAP-10KAP82.4ViTPose+-H
Animal Pose EstimationAP-10KAP80.4ViTPose+-L
Animal Pose EstimationAP-10KAP74.5ViTPose+-B
Animal Pose EstimationAP-10KAP73.1HRNet-w48
Animal Pose EstimationAP-10KAP72.2HRNet-w32
Animal Pose EstimationAP-10KAP71.4ViTPose+-S ViT-S
Animal Pose EstimationAP-10KAP68.1SimpleBaseline-ResNet50
2D Human Pose EstimationCOCO-WholeBodyWB61.2ViTPose+-H
2D Human Pose EstimationCOCO-WholeBodybody75.9ViTPose+-H
2D Human Pose EstimationCOCO-WholeBodyface63.3ViTPose+-H
2D Human Pose EstimationCOCO-WholeBodyfoot77.9ViTPose+-H
2D Human Pose EstimationCOCO-WholeBodyhand54.7ViTPose+-H
1 Image, 2*2 StitchiAP-10KAP82.4ViTPose+-H
1 Image, 2*2 StitchiAP-10KAP80.4ViTPose+-L
1 Image, 2*2 StitchiAP-10KAP74.5ViTPose+-B
1 Image, 2*2 StitchiAP-10KAP73.1HRNet-w48
1 Image, 2*2 StitchiAP-10KAP72.2HRNet-w32
1 Image, 2*2 StitchiAP-10KAP71.4ViTPose+-S ViT-S
1 Image, 2*2 StitchiAP-10KAP68.1SimpleBaseline-ResNet50

Related Papers

$π^3$: Scalable Permutation-Equivariant Visual Geometry Learning2025-07-17Revisiting Reliability in the Reasoning-based Pose Estimation Benchmark2025-07-17DINO-VO: A Feature-based Visual Odometry Leveraging a Visual Foundation Model2025-07-17From Neck to Head: Bio-Impedance Sensing for Head Pose Estimation2025-07-17AthleticsPose: Authentic Sports Motion Dataset on Athletic Field and Evaluation of Monocular 3D Pose Estimation Ability2025-07-17SpatialTrackerV2: 3D Point Tracking Made Easy2025-07-16SGLoc: Semantic Localization System for Camera Pose Estimation from 3D Gaussian Splatting Representation2025-07-16Efficient Calisthenics Skills Classification through Foreground Instance Selection and Depth Estimation2025-07-16