Mu Zhou, Lucas Stoffl, Mackenzie Weygandt Mathis, Alexander Mathis
Frequent interactions between individuals are a fundamental challenge for pose estimation algorithms. Current pipelines either use an object detector together with a pose estimator (top-down approach), or localize all body parts first and then link them to predict the pose of individuals (bottom-up). Yet, when individuals closely interact, top-down methods are ill-defined due to overlapping individuals, and bottom-up methods often falsely infer connections to distant bodyparts. Thus, we propose a novel pipeline called bottom-up conditioned top-down pose estimation (BUCTD) that combines the strengths of bottom-up and top-down methods. Specifically, we propose to use a bottom-up model as the detector, which in addition to an estimated bounding box provides a pose proposal that is fed as condition to an attention-based top-down model. We demonstrate the performance and efficiency of our approach on animal and human pose estimation benchmarks. On CrowdPose and OCHuman, we outperform previous state-of-the-art models by a significant margin. We achieve 78.5 AP on CrowdPose and 48.5 AP on OCHuman, an improvement of 8.6% and 7.8% over the prior art, respectively. Furthermore, we show that our method strongly improves the performance on multi-animal benchmarks involving fish and monkeys. The code is available at https://github.com/amathislab/BUCTD
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Pose Estimation | OCHuman | Test AP | 47.2 | BUCTD (CID-W32) |
| Pose Estimation | OCHuman | Validation AP | 47.7 | BUCTD (CID-W32) |
| Pose Estimation | COCO (Common Objects in Context) | AP | 77.8 | BUCTD (PETR, with generative sampling) |
| Pose Estimation | COCO (Common Objects in Context) | APL | 83.7 | BUCTD (PETR, with generative sampling) |
| Pose Estimation | COCO (Common Objects in Context) | APM | 74.2 | BUCTD (PETR, with generative sampling) |
| Pose Estimation | CrowdPose | AP | 78.5 | BUCTD-W48 (w/cond. input from PETR, and generative sampling) |
| Pose Estimation | CrowdPose | AP Easy | 83.9 | BUCTD-W48 (w/cond. input from PETR, and generative sampling) |
| Pose Estimation | CrowdPose | AP Hard | 72.3 | BUCTD-W48 (w/cond. input from PETR, and generative sampling) |
| Pose Estimation | CrowdPose | AP Medium | 79 | BUCTD-W48 (w/cond. input from PETR, and generative sampling) |
| Pose Estimation | CrowdPose | AP | 76.7 | BUCTD-W48 (w/cond. input from PETR) |
| Pose Estimation | CrowdPose | AP | 72.9 | BUCTD-W48 |
| Pose Estimation | CrowdPose | AP Easy | 83.9 | BUCTD-W48 (w/cond. input from PETR, and generative sampling) |
| Pose Estimation | CrowdPose | AP Hard | 72.3 | BUCTD-W48 (w/cond. input from PETR, and generative sampling) |
| Pose Estimation | CrowdPose | AP Medium | 79 | BUCTD-W48 (w/cond. input from PETR, and generative sampling) |
| Pose Estimation | CrowdPose | mAP @0.5:0.95 | 78.5 | BUCTD-W48 (w/cond. input from PETR, and generative sampling) |
| Pose Estimation | Fish-100 | mAP | 89.1 | HRNet-W48 + Faster R-CNN |
| Pose Estimation | Fish-100 | mAP | 88.7 | BUCTD-preNet-W48 (DLCRNet) |
| Pose Estimation | Fish-100 | mAP | 88 | BUCTD-preNet-W48 (CID-W32) |
| Pose Estimation | Marmoset-8K | mAP | 93.3 | BUCTD-preNet-W48 (CID-W32) |
| Pose Estimation | Marmoset-8K | mAP | 92.5 | CID-W32 |
| Pose Estimation | Marmoset-8K | mAP | 91.6 | BUCTD-CoAM-W48 (DLCRNet) |
| Pose Estimation | TriMouse-161 | mAP | 99.1 | BUCTD-CoAM-W48 (DLCRNet) |
| Pose Estimation | TriMouse-161 | mAP | 95.8 | DLCRNet |
| Pose Estimation | TriMouse-161 | mAP | 86.8 | CID-W32 |
| 3D | OCHuman | Test AP | 47.2 | BUCTD (CID-W32) |
| 3D | OCHuman | Validation AP | 47.7 | BUCTD (CID-W32) |
| 3D | COCO (Common Objects in Context) | AP | 77.8 | BUCTD (PETR, with generative sampling) |
| 3D | COCO (Common Objects in Context) | APL | 83.7 | BUCTD (PETR, with generative sampling) |
| 3D | COCO (Common Objects in Context) | APM | 74.2 | BUCTD (PETR, with generative sampling) |
| 3D | CrowdPose | AP | 78.5 | BUCTD-W48 (w/cond. input from PETR, and generative sampling) |
| 3D | CrowdPose | AP Easy | 83.9 | BUCTD-W48 (w/cond. input from PETR, and generative sampling) |
| 3D | CrowdPose | AP Hard | 72.3 | BUCTD-W48 (w/cond. input from PETR, and generative sampling) |
| 3D | CrowdPose | AP Medium | 79 | BUCTD-W48 (w/cond. input from PETR, and generative sampling) |
| 3D | CrowdPose | AP | 76.7 | BUCTD-W48 (w/cond. input from PETR) |
| 3D | CrowdPose | AP | 72.9 | BUCTD-W48 |
| 3D | CrowdPose | AP Easy | 83.9 | BUCTD-W48 (w/cond. input from PETR, and generative sampling) |
| 3D | CrowdPose | AP Hard | 72.3 | BUCTD-W48 (w/cond. input from PETR, and generative sampling) |
| 3D | CrowdPose | AP Medium | 79 | BUCTD-W48 (w/cond. input from PETR, and generative sampling) |
| 3D | CrowdPose | mAP @0.5:0.95 | 78.5 | BUCTD-W48 (w/cond. input from PETR, and generative sampling) |
| 3D | Fish-100 | mAP | 89.1 | HRNet-W48 + Faster R-CNN |
| 3D | Fish-100 | mAP | 88.7 | BUCTD-preNet-W48 (DLCRNet) |
| 3D | Fish-100 | mAP | 88 | BUCTD-preNet-W48 (CID-W32) |
| 3D | Marmoset-8K | mAP | 93.3 | BUCTD-preNet-W48 (CID-W32) |
| 3D | Marmoset-8K | mAP | 92.5 | CID-W32 |
| 3D | Marmoset-8K | mAP | 91.6 | BUCTD-CoAM-W48 (DLCRNet) |
| 3D | TriMouse-161 | mAP | 99.1 | BUCTD-CoAM-W48 (DLCRNet) |
| 3D | TriMouse-161 | mAP | 95.8 | DLCRNet |
| 3D | TriMouse-161 | mAP | 86.8 | CID-W32 |
| Animal Pose Estimation | Fish-100 | mAP | 89.1 | HRNet-W48 + Faster R-CNN |
| Animal Pose Estimation | Fish-100 | mAP | 88.7 | BUCTD-preNet-W48 (DLCRNet) |
| Animal Pose Estimation | Fish-100 | mAP | 88 | BUCTD-preNet-W48 (CID-W32) |
| Animal Pose Estimation | Marmoset-8K | mAP | 93.3 | BUCTD-preNet-W48 (CID-W32) |
| Animal Pose Estimation | Marmoset-8K | mAP | 92.5 | CID-W32 |
| Animal Pose Estimation | Marmoset-8K | mAP | 91.6 | BUCTD-CoAM-W48 (DLCRNet) |
| Animal Pose Estimation | TriMouse-161 | mAP | 99.1 | BUCTD-CoAM-W48 (DLCRNet) |
| Animal Pose Estimation | TriMouse-161 | mAP | 95.8 | DLCRNet |
| Animal Pose Estimation | TriMouse-161 | mAP | 86.8 | CID-W32 |
| Multi-Person Pose Estimation | CrowdPose | AP Easy | 83.9 | BUCTD-W48 (w/cond. input from PETR, and generative sampling) |
| Multi-Person Pose Estimation | CrowdPose | AP Hard | 72.3 | BUCTD-W48 (w/cond. input from PETR, and generative sampling) |
| Multi-Person Pose Estimation | CrowdPose | AP Medium | 79 | BUCTD-W48 (w/cond. input from PETR, and generative sampling) |
| Multi-Person Pose Estimation | CrowdPose | mAP @0.5:0.95 | 78.5 | BUCTD-W48 (w/cond. input from PETR, and generative sampling) |
| 1 Image, 2*2 Stitchi | OCHuman | Test AP | 47.2 | BUCTD (CID-W32) |
| 1 Image, 2*2 Stitchi | OCHuman | Validation AP | 47.7 | BUCTD (CID-W32) |
| 1 Image, 2*2 Stitchi | COCO (Common Objects in Context) | AP | 77.8 | BUCTD (PETR, with generative sampling) |
| 1 Image, 2*2 Stitchi | COCO (Common Objects in Context) | APL | 83.7 | BUCTD (PETR, with generative sampling) |
| 1 Image, 2*2 Stitchi | COCO (Common Objects in Context) | APM | 74.2 | BUCTD (PETR, with generative sampling) |
| 1 Image, 2*2 Stitchi | CrowdPose | AP | 78.5 | BUCTD-W48 (w/cond. input from PETR, and generative sampling) |
| 1 Image, 2*2 Stitchi | CrowdPose | AP Easy | 83.9 | BUCTD-W48 (w/cond. input from PETR, and generative sampling) |
| 1 Image, 2*2 Stitchi | CrowdPose | AP Hard | 72.3 | BUCTD-W48 (w/cond. input from PETR, and generative sampling) |
| 1 Image, 2*2 Stitchi | CrowdPose | AP Medium | 79 | BUCTD-W48 (w/cond. input from PETR, and generative sampling) |
| 1 Image, 2*2 Stitchi | CrowdPose | AP | 76.7 | BUCTD-W48 (w/cond. input from PETR) |
| 1 Image, 2*2 Stitchi | CrowdPose | AP | 72.9 | BUCTD-W48 |
| 1 Image, 2*2 Stitchi | CrowdPose | AP Easy | 83.9 | BUCTD-W48 (w/cond. input from PETR, and generative sampling) |
| 1 Image, 2*2 Stitchi | CrowdPose | AP Hard | 72.3 | BUCTD-W48 (w/cond. input from PETR, and generative sampling) |
| 1 Image, 2*2 Stitchi | CrowdPose | AP Medium | 79 | BUCTD-W48 (w/cond. input from PETR, and generative sampling) |
| 1 Image, 2*2 Stitchi | CrowdPose | mAP @0.5:0.95 | 78.5 | BUCTD-W48 (w/cond. input from PETR, and generative sampling) |
| 1 Image, 2*2 Stitchi | Fish-100 | mAP | 89.1 | HRNet-W48 + Faster R-CNN |
| 1 Image, 2*2 Stitchi | Fish-100 | mAP | 88.7 | BUCTD-preNet-W48 (DLCRNet) |
| 1 Image, 2*2 Stitchi | Fish-100 | mAP | 88 | BUCTD-preNet-W48 (CID-W32) |
| 1 Image, 2*2 Stitchi | Marmoset-8K | mAP | 93.3 | BUCTD-preNet-W48 (CID-W32) |
| 1 Image, 2*2 Stitchi | Marmoset-8K | mAP | 92.5 | CID-W32 |
| 1 Image, 2*2 Stitchi | Marmoset-8K | mAP | 91.6 | BUCTD-CoAM-W48 (DLCRNet) |
| 1 Image, 2*2 Stitchi | TriMouse-161 | mAP | 99.1 | BUCTD-CoAM-W48 (DLCRNet) |
| 1 Image, 2*2 Stitchi | TriMouse-161 | mAP | 95.8 | DLCRNet |
| 1 Image, 2*2 Stitchi | TriMouse-161 | mAP | 86.8 | CID-W32 |