TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/ViTPose: Simple Vision Transformer Baselines for Human Pos...

ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation

Yufei Xu, Jing Zhang, Qiming Zhang, DaCheng Tao

2022-04-262D Human Pose EstimationPose EstimationKeypoint Detection
PaperPDFCode(official)CodeCodeCodeCodeCode

Abstract

Although no specific domain knowledge is considered in the design, plain vision transformers have shown excellent performance in visual recognition tasks. However, little effort has been made to reveal the potential of such simple structures for pose estimation tasks. In this paper, we show the surprisingly good capabilities of plain vision transformers for pose estimation from various aspects, namely simplicity in model structure, scalability in model size, flexibility in training paradigm, and transferability of knowledge between models, through a simple baseline model called ViTPose. Specifically, ViTPose employs plain and non-hierarchical vision transformers as backbones to extract features for a given person instance and a lightweight decoder for pose estimation. It can be scaled up from 100M to 1B parameters by taking the advantages of the scalable model capacity and high parallelism of transformers, setting a new Pareto front between throughput and performance. Besides, ViTPose is very flexible regarding the attention type, input resolution, pre-training and finetuning strategy, as well as dealing with multiple pose tasks. We also empirically demonstrate that the knowledge of large ViTPose models can be easily transferred to small ones via a simple knowledge token. Experimental results show that our basic ViTPose model outperforms representative methods on the challenging MS COCO Keypoint Detection benchmark, while the largest model sets a new state-of-the-art. The code and models are available at https://github.com/ViTAE-Transformer/ViTPose.

Results

TaskDatasetMetricValueModel
Pose EstimationOCHumanTest AP93.3ViTPose (ViTAE-G, GT bounding boxes)
Pose EstimationOCHumanValidation AP92.8ViTPose (ViTAE-G, GT bounding boxes)
Pose EstimationCOCO val2017AP77.3ViTPose-B (Single-task_GT-bbox_256x192)
Pose EstimationCOCO val2017AP5093.5ViTPose-B (Single-task_GT-bbox_256x192)
Pose EstimationCOCO val2017AP7584.5ViTPose-B (Single-task_GT-bbox_256x192)
Pose EstimationCOCO val2017AR80.4ViTPose-B (Single-task_GT-bbox_256x192)
Pose EstimationCOCO val2017AP75.8ViTPose-B (Single-task_Det-bbox_256x192)
Pose EstimationCOCO val2017AP5090.7ViTPose-B (Single-task_Det-bbox_256x192)
Pose EstimationCOCO val2017AP7583.2ViTPose-B (Single-task_Det-bbox_256x192)
Pose EstimationCOCO val2017AR81.1ViTPose-B (Single-task_Det-bbox_256x192)
Pose EstimationCOCO test-devAP81.1ViTPose (ViTAE-G, ensemble)
Pose EstimationCOCO test-devAP5095ViTPose (ViTAE-G, ensemble)
Pose EstimationCOCO test-devAP7588.2ViTPose (ViTAE-G, ensemble)
Pose EstimationCOCO test-devAPL86ViTPose (ViTAE-G, ensemble)
Pose EstimationCOCO test-devAPM77.8ViTPose (ViTAE-G, ensemble)
Pose EstimationCOCO test-devAR85.6ViTPose (ViTAE-G, ensemble)
Pose EstimationCOCO test-devAP80.9ViTPose (ViTAE-G)
Pose EstimationCOCO test-devAP5094.8ViTPose (ViTAE-G)
Pose EstimationCOCO test-devAP7588.1ViTPose (ViTAE-G)
Pose EstimationCOCO test-devAPL85.9ViTPose (ViTAE-G)
Pose EstimationCOCO test-devAPM77.5ViTPose (ViTAE-G)
Pose EstimationCOCO test-devAR85.4ViTPose (ViTAE-G)
Pose EstimationCrowdPoseAP78.3ViTPose-G
Pose EstimationCrowdPoseAP Hard67.9ViTPose-G
Pose EstimationCrowdPoseAP5085.3ViTPose-G
Pose EstimationCrowdPoseAP7581.4ViTPose-G
Pose EstimationCrowdPoseAPM86.6ViTPose-G
3DOCHumanTest AP93.3ViTPose (ViTAE-G, GT bounding boxes)
3DOCHumanValidation AP92.8ViTPose (ViTAE-G, GT bounding boxes)
3DCOCO val2017AP77.3ViTPose-B (Single-task_GT-bbox_256x192)
3DCOCO val2017AP5093.5ViTPose-B (Single-task_GT-bbox_256x192)
3DCOCO val2017AP7584.5ViTPose-B (Single-task_GT-bbox_256x192)
3DCOCO val2017AR80.4ViTPose-B (Single-task_GT-bbox_256x192)
3DCOCO val2017AP75.8ViTPose-B (Single-task_Det-bbox_256x192)
3DCOCO val2017AP5090.7ViTPose-B (Single-task_Det-bbox_256x192)
3DCOCO val2017AP7583.2ViTPose-B (Single-task_Det-bbox_256x192)
3DCOCO val2017AR81.1ViTPose-B (Single-task_Det-bbox_256x192)
3DCOCO test-devAP81.1ViTPose (ViTAE-G, ensemble)
3DCOCO test-devAP5095ViTPose (ViTAE-G, ensemble)
3DCOCO test-devAP7588.2ViTPose (ViTAE-G, ensemble)
3DCOCO test-devAPL86ViTPose (ViTAE-G, ensemble)
3DCOCO test-devAPM77.8ViTPose (ViTAE-G, ensemble)
3DCOCO test-devAR85.6ViTPose (ViTAE-G, ensemble)
3DCOCO test-devAP80.9ViTPose (ViTAE-G)
3DCOCO test-devAP5094.8ViTPose (ViTAE-G)
3DCOCO test-devAP7588.1ViTPose (ViTAE-G)
3DCOCO test-devAPL85.9ViTPose (ViTAE-G)
3DCOCO test-devAPM77.5ViTPose (ViTAE-G)
3DCOCO test-devAR85.4ViTPose (ViTAE-G)
3DCrowdPoseAP78.3ViTPose-G
3DCrowdPoseAP Hard67.9ViTPose-G
3DCrowdPoseAP5085.3ViTPose-G
3DCrowdPoseAP7581.4ViTPose-G
3DCrowdPoseAPM86.6ViTPose-G
2D Human Pose EstimationHuman-ArtAP0.468ViTPose-h
2D Human Pose EstimationHuman-ArtAP (gt bbox)0.8ViTPose-h
2D Human Pose EstimationHuman-ArtAP0.459ViTPose-l
2D Human Pose EstimationHuman-ArtAP (gt bbox)0.789ViTPose-l
2D Human Pose EstimationHuman-ArtAP0.41ViTpose-b
2D Human Pose EstimationHuman-ArtAP (gt bbox)0.759ViTpose-b
2D Human Pose EstimationHuman-ArtAP0.381ViTPose-s
2D Human Pose EstimationHuman-ArtAP (gt bbox)0.738ViTPose-s
1 Image, 2*2 StitchiOCHumanTest AP93.3ViTPose (ViTAE-G, GT bounding boxes)
1 Image, 2*2 StitchiOCHumanValidation AP92.8ViTPose (ViTAE-G, GT bounding boxes)
1 Image, 2*2 StitchiCOCO val2017AP77.3ViTPose-B (Single-task_GT-bbox_256x192)
1 Image, 2*2 StitchiCOCO val2017AP5093.5ViTPose-B (Single-task_GT-bbox_256x192)
1 Image, 2*2 StitchiCOCO val2017AP7584.5ViTPose-B (Single-task_GT-bbox_256x192)
1 Image, 2*2 StitchiCOCO val2017AR80.4ViTPose-B (Single-task_GT-bbox_256x192)
1 Image, 2*2 StitchiCOCO val2017AP75.8ViTPose-B (Single-task_Det-bbox_256x192)
1 Image, 2*2 StitchiCOCO val2017AP5090.7ViTPose-B (Single-task_Det-bbox_256x192)
1 Image, 2*2 StitchiCOCO val2017AP7583.2ViTPose-B (Single-task_Det-bbox_256x192)
1 Image, 2*2 StitchiCOCO val2017AR81.1ViTPose-B (Single-task_Det-bbox_256x192)
1 Image, 2*2 StitchiCOCO test-devAP81.1ViTPose (ViTAE-G, ensemble)
1 Image, 2*2 StitchiCOCO test-devAP5095ViTPose (ViTAE-G, ensemble)
1 Image, 2*2 StitchiCOCO test-devAP7588.2ViTPose (ViTAE-G, ensemble)
1 Image, 2*2 StitchiCOCO test-devAPL86ViTPose (ViTAE-G, ensemble)
1 Image, 2*2 StitchiCOCO test-devAPM77.8ViTPose (ViTAE-G, ensemble)
1 Image, 2*2 StitchiCOCO test-devAR85.6ViTPose (ViTAE-G, ensemble)
1 Image, 2*2 StitchiCOCO test-devAP80.9ViTPose (ViTAE-G)
1 Image, 2*2 StitchiCOCO test-devAP5094.8ViTPose (ViTAE-G)
1 Image, 2*2 StitchiCOCO test-devAP7588.1ViTPose (ViTAE-G)
1 Image, 2*2 StitchiCOCO test-devAPL85.9ViTPose (ViTAE-G)
1 Image, 2*2 StitchiCOCO test-devAPM77.5ViTPose (ViTAE-G)
1 Image, 2*2 StitchiCOCO test-devAR85.4ViTPose (ViTAE-G)
1 Image, 2*2 StitchiCrowdPoseAP78.3ViTPose-G
1 Image, 2*2 StitchiCrowdPoseAP Hard67.9ViTPose-G
1 Image, 2*2 StitchiCrowdPoseAP5085.3ViTPose-G
1 Image, 2*2 StitchiCrowdPoseAP7581.4ViTPose-G
1 Image, 2*2 StitchiCrowdPoseAPM86.6ViTPose-G

Related Papers

$π^3$: Scalable Permutation-Equivariant Visual Geometry Learning2025-07-17Revisiting Reliability in the Reasoning-based Pose Estimation Benchmark2025-07-17DINO-VO: A Feature-based Visual Odometry Leveraging a Visual Foundation Model2025-07-17From Neck to Head: Bio-Impedance Sensing for Head Pose Estimation2025-07-17AthleticsPose: Authentic Sports Motion Dataset on Athletic Field and Evaluation of Monocular 3D Pose Estimation Ability2025-07-17SpatialTrackerV2: 3D Point Tracking Made Easy2025-07-16SGLoc: Semantic Localization System for Camera Pose Estimation from 3D Gaussian Splatting Representation2025-07-16Efficient Calisthenics Skills Classification through Foreground Instance Selection and Depth Estimation2025-07-16