TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Sapiens: Foundation for Human Vision Models

Sapiens: Foundation for Human Vision Models

Rawal Khirodkar, Timur Bagautdinov, Julieta Martinez, Su Zhaoen, Austin James, Peter Selednik, Stuart Anderson, Shunsuke Saito

2024-08-222D Human Pose EstimationHuman Part SegmentationSurface Normal EstimationPose EstimationKeypoint DetectionDepth Estimation2D Pose Estimation
PaperPDFCodeCode

Abstract

We present Sapiens, a family of models for four fundamental human-centric vision tasks -- 2D pose estimation, body-part segmentation, depth estimation, and surface normal prediction. Our models natively support 1K high-resolution inference and are extremely easy to adapt for individual tasks by simply fine-tuning models pretrained on over 300 million in-the-wild human images. We observe that, given the same computational budget, self-supervised pretraining on a curated dataset of human images significantly boosts the performance for a diverse set of human-centric tasks. The resulting models exhibit remarkable generalization to in-the-wild data, even when labeled data is scarce or entirely synthetic. Our simple model design also brings scalability -- model performance across tasks improves as we scale the number of parameters from 0.3 to 2 billion. Sapiens consistently surpasses existing baselines across various human-centric benchmarks. We achieve significant improvements over the prior state-of-the-art on Humans-5K (pose) by 7.6 mAP, Humans-2K (part-seg) by 17.1 mIoU, Hi4D (depth) by 22.4% relative RMSE, and THuman2 (normal) by 53.5% relative angular error. Project page: https://about.meta.com/realitylabs/codecavatars/sapiens.

Results

TaskDatasetMetricValueModel
Pose EstimationCOCO (Common Objects in Context)Validation AP82.2Sapiens-2B
Pose EstimationCOCO (Common Objects in Context)Validation AP82.1Sapiens-1B
Pose EstimationCOCO (Common Objects in Context)Validation AP81.2Sapiens-0.6B
Pose EstimationCOCO (Common Objects in Context)Validation AP79.6Sapiens-0.3B
3DCOCO (Common Objects in Context)Validation AP82.2Sapiens-2B
3DCOCO (Common Objects in Context)Validation AP82.1Sapiens-1B
3DCOCO (Common Objects in Context)Validation AP81.2Sapiens-0.6B
3DCOCO (Common Objects in Context)Validation AP79.6Sapiens-0.3B
2D Human Pose EstimationCOCO-WholeBodyWB62Sapiens-0.3B
2D Human Pose EstimationCOCO-WholeBodybody66.4Sapiens-0.3B
2D Human Pose EstimationCOCO-WholeBodyface87.1Sapiens-0.3B
2D Human Pose EstimationCOCO-WholeBodyfoot67.3Sapiens-0.3B
2D Human Pose EstimationCOCO-WholeBodyhand58.1Sapiens-0.3B
1 Image, 2*2 StitchiCOCO (Common Objects in Context)Validation AP82.2Sapiens-2B
1 Image, 2*2 StitchiCOCO (Common Objects in Context)Validation AP82.1Sapiens-1B
1 Image, 2*2 StitchiCOCO (Common Objects in Context)Validation AP81.2Sapiens-0.6B
1 Image, 2*2 StitchiCOCO (Common Objects in Context)Validation AP79.6Sapiens-0.3B

Related Papers

$π^3$: Scalable Permutation-Equivariant Visual Geometry Learning2025-07-17Revisiting Reliability in the Reasoning-based Pose Estimation Benchmark2025-07-17DINO-VO: A Feature-based Visual Odometry Leveraging a Visual Foundation Model2025-07-17From Neck to Head: Bio-Impedance Sensing for Head Pose Estimation2025-07-17AthleticsPose: Authentic Sports Motion Dataset on Athletic Field and Evaluation of Monocular 3D Pose Estimation Ability2025-07-17$S^2M^2$: Scalable Stereo Matching Model for Reliable Depth Estimation2025-07-17SpatialTrackerV2: 3D Point Tracking Made Easy2025-07-16SGLoc: Semantic Localization System for Camera Pose Estimation from 3D Gaussian Splatting Representation2025-07-16