Shaokai Ye, Anastasiia Filippova, Jessy Lauer, Steffen Schneider, Maxime Vidal, Tian Qiu, Alexander Mathis, Mackenzie Weygandt Mathis
Quantification of behavior is critical in applications ranging from neuroscience, veterinary medicine and animal conservation efforts. A common key step for behavioral analysis is first extracting relevant keypoints on animals, known as pose estimation. However, reliable inference of poses currently requires domain knowledge and manual labeling effort to build supervised models. We present a series of technical innovations that enable a new method, collectively called SuperAnimal, to develop unified foundation models that can be used on over 45 species, without additional human labels. Concretely, we introduce a method to unify the keypoint space across differently labeled datasets (via our generalized data converter) and for training these diverse datasets in a manner such that they don't catastrophically forget keypoints given the unbalanced inputs (via our keypoint gradient masking and memory replay approaches). These models show excellent performance across six pose benchmarks. Then, to ensure maximal usability for end-users, we demonstrate how to fine-tune the models on differently labeled data and provide tooling for unsupervised video adaptation to boost performance and decrease jitter across frames. If the models are fine-tuned, we show SuperAnimal models are 10-100$\times$ more data efficient than prior transfer-learning-based approaches. We illustrate the utility of our models in behavioral classification in mice and gait analysis in horses. Collectively, this presents a data-efficient solution for animal pose estimation.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Pose Estimation | AP-10K | AP | 80.113 | SuperAnimal-HRNetw32 |
| Pose Estimation | AP-10K | AP | 68.038 | zero-shot SuperAnimal-HRNetw32 |
| Pose Estimation | Animal-Pose Dataset | AP | 86 | SuperAnimal-AnimalTokenPose |
| Pose Estimation | TriMouse-161 | mAP | 98.547 | SuperAnimal HRNetw32 |
| Pose Estimation | TriMouse-161 | mAP | 76.139 | zero-shot SuperAnimal HRNetw32 |
| Pose Estimation | Horse-10 | Normalized Error (OOD) | 0.1091 | SuperAnimal-Quadruped HRNet-w32 |
| Pose Estimation | Horse-10 | Normalized Error (OOD) | 0.179 | mmpose HRNet-w32 (w/ImageNet pretrained weights) |
| 2D Pose Estimation | iRodent | Average mAP | 72.971 | fine-tuned HRNetw32 pretrained on SuperAnimal (1 fac of data) |
| 2D Pose Estimation | iRodent | Average mAP | 61.635 | fine-tuned HRNetw32 pretrained on AP-10K (1 fac of data) |
| 2D Pose Estimation | iRodent | Average mAP | 60.853 | fine-tuned HRNetw32 pretrained on SuperAnimal (0.01 fac of data) |
| 2D Pose Estimation | iRodent | Average mAP | 58.857 | fine-tuned HRNetw32 pretrained on ImageNet |
| 2D Pose Estimation | iRodent | Average mAP | 58.557 | zero-shot HRNet-w32 pretrained on SuperAnimal-Quadruped |
| 2D Pose Estimation | iRodent | Average mAP | 55.415 | zero-shot AnimalTokenPose pretrained on AP-10K |
| 2D Pose Estimation | iRodent | Average mAP | 43.144 | fine-tuned HRNetw32 pretrained on AP-10K (0.01 fac of data) |
| 2D Pose Estimation | iRodent | Average mAP | 40.389 | zero-shot HRNet-w32 pretrained on AP-10K |
| 3D | AP-10K | AP | 80.113 | SuperAnimal-HRNetw32 |
| 3D | AP-10K | AP | 68.038 | zero-shot SuperAnimal-HRNetw32 |
| 3D | Animal-Pose Dataset | AP | 86 | SuperAnimal-AnimalTokenPose |
| 3D | TriMouse-161 | mAP | 98.547 | SuperAnimal HRNetw32 |
| 3D | TriMouse-161 | mAP | 76.139 | zero-shot SuperAnimal HRNetw32 |
| 3D | Horse-10 | Normalized Error (OOD) | 0.1091 | SuperAnimal-Quadruped HRNet-w32 |
| 3D | Horse-10 | Normalized Error (OOD) | 0.179 | mmpose HRNet-w32 (w/ImageNet pretrained weights) |
| Animal Pose Estimation | AP-10K | AP | 80.113 | SuperAnimal-HRNetw32 |
| Animal Pose Estimation | AP-10K | AP | 68.038 | zero-shot SuperAnimal-HRNetw32 |
| Animal Pose Estimation | Animal-Pose Dataset | AP | 86 | SuperAnimal-AnimalTokenPose |
| Animal Pose Estimation | TriMouse-161 | mAP | 98.547 | SuperAnimal HRNetw32 |
| Animal Pose Estimation | TriMouse-161 | mAP | 76.139 | zero-shot SuperAnimal HRNetw32 |
| Animal Pose Estimation | Horse-10 | Normalized Error (OOD) | 0.1091 | SuperAnimal-Quadruped HRNet-w32 |
| Animal Pose Estimation | Horse-10 | Normalized Error (OOD) | 0.179 | mmpose HRNet-w32 (w/ImageNet pretrained weights) |
| 2D Classification | iRodent | Average mAP | 72.971 | fine-tuned HRNetw32 pretrained on SuperAnimal (1 fac of data) |
| 2D Classification | iRodent | Average mAP | 61.635 | fine-tuned HRNetw32 pretrained on AP-10K (1 fac of data) |
| 2D Classification | iRodent | Average mAP | 60.853 | fine-tuned HRNetw32 pretrained on SuperAnimal (0.01 fac of data) |
| 2D Classification | iRodent | Average mAP | 58.857 | fine-tuned HRNetw32 pretrained on ImageNet |
| 2D Classification | iRodent | Average mAP | 58.557 | zero-shot HRNet-w32 pretrained on SuperAnimal-Quadruped |
| 2D Classification | iRodent | Average mAP | 55.415 | zero-shot AnimalTokenPose pretrained on AP-10K |
| 2D Classification | iRodent | Average mAP | 43.144 | fine-tuned HRNetw32 pretrained on AP-10K (0.01 fac of data) |
| 2D Classification | iRodent | Average mAP | 40.389 | zero-shot HRNet-w32 pretrained on AP-10K |
| 1 Image, 2*2 Stitchi | AP-10K | AP | 80.113 | SuperAnimal-HRNetw32 |
| 1 Image, 2*2 Stitchi | AP-10K | AP | 68.038 | zero-shot SuperAnimal-HRNetw32 |
| 1 Image, 2*2 Stitchi | Animal-Pose Dataset | AP | 86 | SuperAnimal-AnimalTokenPose |
| 1 Image, 2*2 Stitchi | TriMouse-161 | mAP | 98.547 | SuperAnimal HRNetw32 |
| 1 Image, 2*2 Stitchi | TriMouse-161 | mAP | 76.139 | zero-shot SuperAnimal HRNetw32 |
| 1 Image, 2*2 Stitchi | Horse-10 | Normalized Error (OOD) | 0.1091 | SuperAnimal-Quadruped HRNet-w32 |
| 1 Image, 2*2 Stitchi | Horse-10 | Normalized Error (OOD) | 0.179 | mmpose HRNet-w32 (w/ImageNet pretrained weights) |