Rawal Khirodkar, Timur Bagautdinov, Julieta Martinez, Su Zhaoen, Austin James, Peter Selednik, Stuart Anderson, Shunsuke Saito
We present Sapiens, a family of models for four fundamental human-centric vision tasks -- 2D pose estimation, body-part segmentation, depth estimation, and surface normal prediction. Our models natively support 1K high-resolution inference and are extremely easy to adapt for individual tasks by simply fine-tuning models pretrained on over 300 million in-the-wild human images. We observe that, given the same computational budget, self-supervised pretraining on a curated dataset of human images significantly boosts the performance for a diverse set of human-centric tasks. The resulting models exhibit remarkable generalization to in-the-wild data, even when labeled data is scarce or entirely synthetic. Our simple model design also brings scalability -- model performance across tasks improves as we scale the number of parameters from 0.3 to 2 billion. Sapiens consistently surpasses existing baselines across various human-centric benchmarks. We achieve significant improvements over the prior state-of-the-art on Humans-5K (pose) by 7.6 mAP, Humans-2K (part-seg) by 17.1 mIoU, Hi4D (depth) by 22.4% relative RMSE, and THuman2 (normal) by 53.5% relative angular error. Project page: https://about.meta.com/realitylabs/codecavatars/sapiens.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Pose Estimation | COCO (Common Objects in Context) | Validation AP | 82.2 | Sapiens-2B |
| Pose Estimation | COCO (Common Objects in Context) | Validation AP | 82.1 | Sapiens-1B |
| Pose Estimation | COCO (Common Objects in Context) | Validation AP | 81.2 | Sapiens-0.6B |
| Pose Estimation | COCO (Common Objects in Context) | Validation AP | 79.6 | Sapiens-0.3B |
| 3D | COCO (Common Objects in Context) | Validation AP | 82.2 | Sapiens-2B |
| 3D | COCO (Common Objects in Context) | Validation AP | 82.1 | Sapiens-1B |
| 3D | COCO (Common Objects in Context) | Validation AP | 81.2 | Sapiens-0.6B |
| 3D | COCO (Common Objects in Context) | Validation AP | 79.6 | Sapiens-0.3B |
| 2D Human Pose Estimation | COCO-WholeBody | WB | 62 | Sapiens-0.3B |
| 2D Human Pose Estimation | COCO-WholeBody | body | 66.4 | Sapiens-0.3B |
| 2D Human Pose Estimation | COCO-WholeBody | face | 87.1 | Sapiens-0.3B |
| 2D Human Pose Estimation | COCO-WholeBody | foot | 67.3 | Sapiens-0.3B |
| 2D Human Pose Estimation | COCO-WholeBody | hand | 58.1 | Sapiens-0.3B |
| 1 Image, 2*2 Stitchi | COCO (Common Objects in Context) | Validation AP | 82.2 | Sapiens-2B |
| 1 Image, 2*2 Stitchi | COCO (Common Objects in Context) | Validation AP | 82.1 | Sapiens-1B |
| 1 Image, 2*2 Stitchi | COCO (Common Objects in Context) | Validation AP | 81.2 | Sapiens-0.6B |
| 1 Image, 2*2 Stitchi | COCO (Common Objects in Context) | Validation AP | 79.6 | Sapiens-0.3B |