Rethinking pose estimation in crowds: overcoming the detection information-bottleneck and ambiguity

Mu Zhou, Lucas Stoffl, Mackenzie Weygandt Mathis, Alexander Mathis

2023-06-13Pose Estimation Multi-Person Pose Estimation Animal Pose Estimation

Abstract

Frequent interactions between individuals are a fundamental challenge for pose estimation algorithms. Current pipelines either use an object detector together with a pose estimator (top-down approach), or localize all body parts first and then link them to predict the pose of individuals (bottom-up). Yet, when individuals closely interact, top-down methods are ill-defined due to overlapping individuals, and bottom-up methods often falsely infer connections to distant bodyparts. Thus, we propose a novel pipeline called bottom-up conditioned top-down pose estimation (BUCTD) that combines the strengths of bottom-up and top-down methods. Specifically, we propose to use a bottom-up model as the detector, which in addition to an estimated bounding box provides a pose proposal that is fed as condition to an attention-based top-down model. We demonstrate the performance and efficiency of our approach on animal and human pose estimation benchmarks. On CrowdPose and OCHuman, we outperform previous state-of-the-art models by a significant margin. We achieve 78.5 AP on CrowdPose and 48.5 AP on OCHuman, an improvement of 8.6% and 7.8% over the prior art, respectively. Furthermore, we show that our method strongly improves the performance on multi-animal benchmarks involving fish and monkeys. The code is available at https://github.com/amathislab/BUCTD

Results

Task	Dataset	Metric	Value	Model
Pose Estimation	OCHuman	Test AP	47.2	BUCTD (CID-W32)
Pose Estimation	OCHuman	Validation AP	47.7	BUCTD (CID-W32)
Pose Estimation	COCO (Common Objects in Context)	AP	77.8	BUCTD (PETR, with generative sampling)
Pose Estimation	COCO (Common Objects in Context)	APL	83.7	BUCTD (PETR, with generative sampling)
Pose Estimation	COCO (Common Objects in Context)	APM	74.2	BUCTD (PETR, with generative sampling)
Pose Estimation	CrowdPose	AP	78.5	BUCTD-W48 (w/cond. input from PETR, and generative sampling)
Pose Estimation	CrowdPose	AP Easy	83.9	BUCTD-W48 (w/cond. input from PETR, and generative sampling)
Pose Estimation	CrowdPose	AP Hard	72.3	BUCTD-W48 (w/cond. input from PETR, and generative sampling)
Pose Estimation	CrowdPose	AP Medium	79	BUCTD-W48 (w/cond. input from PETR, and generative sampling)
Pose Estimation	CrowdPose	AP	76.7	BUCTD-W48 (w/cond. input from PETR)
Pose Estimation	CrowdPose	AP	72.9	BUCTD-W48
Pose Estimation	CrowdPose	AP Easy	83.9	BUCTD-W48 (w/cond. input from PETR, and generative sampling)
Pose Estimation	CrowdPose	AP Hard	72.3	BUCTD-W48 (w/cond. input from PETR, and generative sampling)
Pose Estimation	CrowdPose	AP Medium	79	BUCTD-W48 (w/cond. input from PETR, and generative sampling)
Pose Estimation	CrowdPose	mAP @0.5:0.95	78.5	BUCTD-W48 (w/cond. input from PETR, and generative sampling)
Pose Estimation	Fish-100	mAP	89.1	HRNet-W48 + Faster R-CNN
Pose Estimation	Fish-100	mAP	88.7	BUCTD-preNet-W48 (DLCRNet)
Pose Estimation	Fish-100	mAP	88	BUCTD-preNet-W48 (CID-W32)
Pose Estimation	Marmoset-8K	mAP	93.3	BUCTD-preNet-W48 (CID-W32)
Pose Estimation	Marmoset-8K	mAP	92.5	CID-W32
Pose Estimation	Marmoset-8K	mAP	91.6	BUCTD-CoAM-W48 (DLCRNet)
Pose Estimation	TriMouse-161	mAP	99.1	BUCTD-CoAM-W48 (DLCRNet)
Pose Estimation	TriMouse-161	mAP	95.8	DLCRNet
Pose Estimation	TriMouse-161	mAP	86.8	CID-W32
3D	OCHuman	Test AP	47.2	BUCTD (CID-W32)
3D	OCHuman	Validation AP	47.7	BUCTD (CID-W32)
3D	COCO (Common Objects in Context)	AP	77.8	BUCTD (PETR, with generative sampling)
3D	COCO (Common Objects in Context)	APL	83.7	BUCTD (PETR, with generative sampling)
3D	COCO (Common Objects in Context)	APM	74.2	BUCTD (PETR, with generative sampling)
3D	CrowdPose	AP	78.5	BUCTD-W48 (w/cond. input from PETR, and generative sampling)
3D	CrowdPose	AP Easy	83.9	BUCTD-W48 (w/cond. input from PETR, and generative sampling)
3D	CrowdPose	AP Hard	72.3	BUCTD-W48 (w/cond. input from PETR, and generative sampling)
3D	CrowdPose	AP Medium	79	BUCTD-W48 (w/cond. input from PETR, and generative sampling)
3D	CrowdPose	AP	76.7	BUCTD-W48 (w/cond. input from PETR)
3D	CrowdPose	AP	72.9	BUCTD-W48
3D	CrowdPose	AP Easy	83.9	BUCTD-W48 (w/cond. input from PETR, and generative sampling)
3D	CrowdPose	AP Hard	72.3	BUCTD-W48 (w/cond. input from PETR, and generative sampling)
3D	CrowdPose	AP Medium	79	BUCTD-W48 (w/cond. input from PETR, and generative sampling)
3D	CrowdPose	mAP @0.5:0.95	78.5	BUCTD-W48 (w/cond. input from PETR, and generative sampling)
3D	Fish-100	mAP	89.1	HRNet-W48 + Faster R-CNN
3D	Fish-100	mAP	88.7	BUCTD-preNet-W48 (DLCRNet)
3D	Fish-100	mAP	88	BUCTD-preNet-W48 (CID-W32)
3D	Marmoset-8K	mAP	93.3	BUCTD-preNet-W48 (CID-W32)
3D	Marmoset-8K	mAP	92.5	CID-W32
3D	Marmoset-8K	mAP	91.6	BUCTD-CoAM-W48 (DLCRNet)
3D	TriMouse-161	mAP	99.1	BUCTD-CoAM-W48 (DLCRNet)
3D	TriMouse-161	mAP	95.8	DLCRNet
3D	TriMouse-161	mAP	86.8	CID-W32
Animal Pose Estimation	Fish-100	mAP	89.1	HRNet-W48 + Faster R-CNN
Animal Pose Estimation	Fish-100	mAP	88.7	BUCTD-preNet-W48 (DLCRNet)
Animal Pose Estimation	Fish-100	mAP	88	BUCTD-preNet-W48 (CID-W32)
Animal Pose Estimation	Marmoset-8K	mAP	93.3	BUCTD-preNet-W48 (CID-W32)
Animal Pose Estimation	Marmoset-8K	mAP	92.5	CID-W32
Animal Pose Estimation	Marmoset-8K	mAP	91.6	BUCTD-CoAM-W48 (DLCRNet)
Animal Pose Estimation	TriMouse-161	mAP	99.1	BUCTD-CoAM-W48 (DLCRNet)
Animal Pose Estimation	TriMouse-161	mAP	95.8	DLCRNet
Animal Pose Estimation	TriMouse-161	mAP	86.8	CID-W32
Multi-Person Pose Estimation	CrowdPose	AP Easy	83.9	BUCTD-W48 (w/cond. input from PETR, and generative sampling)
Multi-Person Pose Estimation	CrowdPose	AP Hard	72.3	BUCTD-W48 (w/cond. input from PETR, and generative sampling)
Multi-Person Pose Estimation	CrowdPose	AP Medium	79	BUCTD-W48 (w/cond. input from PETR, and generative sampling)
Multi-Person Pose Estimation	CrowdPose	mAP @0.5:0.95	78.5	BUCTD-W48 (w/cond. input from PETR, and generative sampling)
1 Image, 2*2 Stitchi	OCHuman	Test AP	47.2	BUCTD (CID-W32)
1 Image, 2*2 Stitchi	OCHuman	Validation AP	47.7	BUCTD (CID-W32)
1 Image, 2*2 Stitchi	COCO (Common Objects in Context)	AP	77.8	BUCTD (PETR, with generative sampling)
1 Image, 2*2 Stitchi	COCO (Common Objects in Context)	APL	83.7	BUCTD (PETR, with generative sampling)
1 Image, 2*2 Stitchi	COCO (Common Objects in Context)	APM	74.2	BUCTD (PETR, with generative sampling)
1 Image, 2*2 Stitchi	CrowdPose	AP	78.5	BUCTD-W48 (w/cond. input from PETR, and generative sampling)
1 Image, 2*2 Stitchi	CrowdPose	AP Easy	83.9	BUCTD-W48 (w/cond. input from PETR, and generative sampling)
1 Image, 2*2 Stitchi	CrowdPose	AP Hard	72.3	BUCTD-W48 (w/cond. input from PETR, and generative sampling)
1 Image, 2*2 Stitchi	CrowdPose	AP Medium	79	BUCTD-W48 (w/cond. input from PETR, and generative sampling)
1 Image, 2*2 Stitchi	CrowdPose	AP	76.7	BUCTD-W48 (w/cond. input from PETR)
1 Image, 2*2 Stitchi	CrowdPose	AP	72.9	BUCTD-W48
1 Image, 2*2 Stitchi	CrowdPose	AP Easy	83.9	BUCTD-W48 (w/cond. input from PETR, and generative sampling)
1 Image, 2*2 Stitchi	CrowdPose	AP Hard	72.3	BUCTD-W48 (w/cond. input from PETR, and generative sampling)
1 Image, 2*2 Stitchi	CrowdPose	AP Medium	79	BUCTD-W48 (w/cond. input from PETR, and generative sampling)
1 Image, 2*2 Stitchi	CrowdPose	mAP @0.5:0.95	78.5	BUCTD-W48 (w/cond. input from PETR, and generative sampling)
1 Image, 2*2 Stitchi	Fish-100	mAP	89.1	HRNet-W48 + Faster R-CNN
1 Image, 2*2 Stitchi	Fish-100	mAP	88.7	BUCTD-preNet-W48 (DLCRNet)
1 Image, 2*2 Stitchi	Fish-100	mAP	88	BUCTD-preNet-W48 (CID-W32)
1 Image, 2*2 Stitchi	Marmoset-8K	mAP	93.3	BUCTD-preNet-W48 (CID-W32)
1 Image, 2*2 Stitchi	Marmoset-8K	mAP	92.5	CID-W32
1 Image, 2*2 Stitchi	Marmoset-8K	mAP	91.6	BUCTD-CoAM-W48 (DLCRNet)
1 Image, 2*2 Stitchi	TriMouse-161	mAP	99.1	BUCTD-CoAM-W48 (DLCRNet)
1 Image, 2*2 Stitchi	TriMouse-161	mAP	95.8	DLCRNet
1 Image, 2*2 Stitchi	TriMouse-161	mAP	86.8	CID-W32

Abstract

Results

Task	Dataset	Metric	Value	Model
Pose Estimation	OCHuman	Test AP	47.2	BUCTD (CID-W32)
Pose Estimation	OCHuman	Validation AP	47.7	BUCTD (CID-W32)
Pose Estimation	COCO (Common Objects in Context)	AP	77.8	BUCTD (PETR, with generative sampling)
Pose Estimation	COCO (Common Objects in Context)	APL	83.7	BUCTD (PETR, with generative sampling)
Pose Estimation	COCO (Common Objects in Context)	APM	74.2	BUCTD (PETR, with generative sampling)
Pose Estimation	CrowdPose	AP	78.5	BUCTD-W48 (w/cond. input from PETR, and generative sampling)
Pose Estimation	CrowdPose	AP Easy	83.9	BUCTD-W48 (w/cond. input from PETR, and generative sampling)
Pose Estimation	CrowdPose	AP Hard	72.3	BUCTD-W48 (w/cond. input from PETR, and generative sampling)
Pose Estimation	CrowdPose	AP Medium	79	BUCTD-W48 (w/cond. input from PETR, and generative sampling)
Pose Estimation	CrowdPose	AP	76.7	BUCTD-W48 (w/cond. input from PETR)
Pose Estimation	CrowdPose	AP	72.9	BUCTD-W48
Pose Estimation	CrowdPose	AP Easy	83.9	BUCTD-W48 (w/cond. input from PETR, and generative sampling)
Pose Estimation	CrowdPose	AP Hard	72.3	BUCTD-W48 (w/cond. input from PETR, and generative sampling)
Pose Estimation	CrowdPose	AP Medium	79	BUCTD-W48 (w/cond. input from PETR, and generative sampling)
Pose Estimation	CrowdPose	mAP @0.5:0.95	78.5	BUCTD-W48 (w/cond. input from PETR, and generative sampling)
Pose Estimation	Fish-100	mAP	89.1	HRNet-W48 + Faster R-CNN
Pose Estimation	Fish-100	mAP	88.7	BUCTD-preNet-W48 (DLCRNet)
Pose Estimation	Fish-100	mAP	88	BUCTD-preNet-W48 (CID-W32)
Pose Estimation	Marmoset-8K	mAP	93.3	BUCTD-preNet-W48 (CID-W32)
Pose Estimation	Marmoset-8K	mAP	92.5	CID-W32
Pose Estimation	Marmoset-8K	mAP	91.6	BUCTD-CoAM-W48 (DLCRNet)
Pose Estimation	TriMouse-161	mAP	99.1	BUCTD-CoAM-W48 (DLCRNet)
Pose Estimation	TriMouse-161	mAP	95.8	DLCRNet
Pose Estimation	TriMouse-161	mAP	86.8	CID-W32
3D	OCHuman	Test AP	47.2	BUCTD (CID-W32)
3D	OCHuman	Validation AP	47.7	BUCTD (CID-W32)
3D	COCO (Common Objects in Context)	AP	77.8	BUCTD (PETR, with generative sampling)
3D	COCO (Common Objects in Context)	APL	83.7	BUCTD (PETR, with generative sampling)
3D	COCO (Common Objects in Context)	APM	74.2	BUCTD (PETR, with generative sampling)
3D	CrowdPose	AP	78.5	BUCTD-W48 (w/cond. input from PETR, and generative sampling)
3D	CrowdPose	AP Easy	83.9	BUCTD-W48 (w/cond. input from PETR, and generative sampling)
3D	CrowdPose	AP Hard	72.3	BUCTD-W48 (w/cond. input from PETR, and generative sampling)
3D	CrowdPose	AP Medium	79	BUCTD-W48 (w/cond. input from PETR, and generative sampling)
3D	CrowdPose	AP	76.7	BUCTD-W48 (w/cond. input from PETR)
3D	CrowdPose	AP	72.9	BUCTD-W48
3D	CrowdPose	AP Easy	83.9	BUCTD-W48 (w/cond. input from PETR, and generative sampling)
3D	CrowdPose	AP Hard	72.3	BUCTD-W48 (w/cond. input from PETR, and generative sampling)
3D	CrowdPose	AP Medium	79	BUCTD-W48 (w/cond. input from PETR, and generative sampling)
3D	CrowdPose	mAP @0.5:0.95	78.5	BUCTD-W48 (w/cond. input from PETR, and generative sampling)
3D	Fish-100	mAP	89.1	HRNet-W48 + Faster R-CNN
3D	Fish-100	mAP	88.7	BUCTD-preNet-W48 (DLCRNet)
3D	Fish-100	mAP	88	BUCTD-preNet-W48 (CID-W32)
3D	Marmoset-8K	mAP	93.3	BUCTD-preNet-W48 (CID-W32)
3D	Marmoset-8K	mAP	92.5	CID-W32
3D	Marmoset-8K	mAP	91.6	BUCTD-CoAM-W48 (DLCRNet)
3D	TriMouse-161	mAP	99.1	BUCTD-CoAM-W48 (DLCRNet)
3D	TriMouse-161	mAP	95.8	DLCRNet
3D	TriMouse-161	mAP	86.8	CID-W32
Animal Pose Estimation	Fish-100	mAP	89.1	HRNet-W48 + Faster R-CNN
Animal Pose Estimation	Fish-100	mAP	88.7	BUCTD-preNet-W48 (DLCRNet)
Animal Pose Estimation	Fish-100	mAP	88	BUCTD-preNet-W48 (CID-W32)
Animal Pose Estimation	Marmoset-8K	mAP	93.3	BUCTD-preNet-W48 (CID-W32)
Animal Pose Estimation	Marmoset-8K	mAP	92.5	CID-W32
Animal Pose Estimation	Marmoset-8K	mAP	91.6	BUCTD-CoAM-W48 (DLCRNet)
Animal Pose Estimation	TriMouse-161	mAP	99.1	BUCTD-CoAM-W48 (DLCRNet)
Animal Pose Estimation	TriMouse-161	mAP	95.8	DLCRNet
Animal Pose Estimation	TriMouse-161	mAP	86.8	CID-W32
Multi-Person Pose Estimation	CrowdPose	AP Easy	83.9	BUCTD-W48 (w/cond. input from PETR, and generative sampling)
Multi-Person Pose Estimation	CrowdPose	AP Hard	72.3	BUCTD-W48 (w/cond. input from PETR, and generative sampling)
Multi-Person Pose Estimation	CrowdPose	AP Medium	79	BUCTD-W48 (w/cond. input from PETR, and generative sampling)
Multi-Person Pose Estimation	CrowdPose	mAP @0.5:0.95	78.5	BUCTD-W48 (w/cond. input from PETR, and generative sampling)
1 Image, 2*2 Stitchi	OCHuman	Test AP	47.2	BUCTD (CID-W32)
1 Image, 2*2 Stitchi	OCHuman	Validation AP	47.7	BUCTD (CID-W32)
1 Image, 2*2 Stitchi	COCO (Common Objects in Context)	AP	77.8	BUCTD (PETR, with generative sampling)
1 Image, 2*2 Stitchi	COCO (Common Objects in Context)	APL	83.7	BUCTD (PETR, with generative sampling)
1 Image, 2*2 Stitchi	COCO (Common Objects in Context)	APM	74.2	BUCTD (PETR, with generative sampling)
1 Image, 2*2 Stitchi	CrowdPose	AP	78.5	BUCTD-W48 (w/cond. input from PETR, and generative sampling)
1 Image, 2*2 Stitchi	CrowdPose	AP Easy	83.9	BUCTD-W48 (w/cond. input from PETR, and generative sampling)
1 Image, 2*2 Stitchi	CrowdPose	AP Hard	72.3	BUCTD-W48 (w/cond. input from PETR, and generative sampling)
1 Image, 2*2 Stitchi	CrowdPose	AP Medium	79	BUCTD-W48 (w/cond. input from PETR, and generative sampling)
1 Image, 2*2 Stitchi	CrowdPose	AP	76.7	BUCTD-W48 (w/cond. input from PETR)
1 Image, 2*2 Stitchi	CrowdPose	AP	72.9	BUCTD-W48
1 Image, 2*2 Stitchi	CrowdPose	AP Easy	83.9	BUCTD-W48 (w/cond. input from PETR, and generative sampling)
1 Image, 2*2 Stitchi	CrowdPose	AP Hard	72.3	BUCTD-W48 (w/cond. input from PETR, and generative sampling)
1 Image, 2*2 Stitchi	CrowdPose	AP Medium	79	BUCTD-W48 (w/cond. input from PETR, and generative sampling)
1 Image, 2*2 Stitchi	CrowdPose	mAP @0.5:0.95	78.5	BUCTD-W48 (w/cond. input from PETR, and generative sampling)
1 Image, 2*2 Stitchi	Fish-100	mAP	89.1	HRNet-W48 + Faster R-CNN
1 Image, 2*2 Stitchi	Fish-100	mAP	88.7	BUCTD-preNet-W48 (DLCRNet)
1 Image, 2*2 Stitchi	Fish-100	mAP	88	BUCTD-preNet-W48 (CID-W32)
1 Image, 2*2 Stitchi	Marmoset-8K	mAP	93.3	BUCTD-preNet-W48 (CID-W32)
1 Image, 2*2 Stitchi	Marmoset-8K	mAP	92.5	CID-W32
1 Image, 2*2 Stitchi	Marmoset-8K	mAP	91.6	BUCTD-CoAM-W48 (DLCRNet)
1 Image, 2*2 Stitchi	TriMouse-161	mAP	99.1	BUCTD-CoAM-W48 (DLCRNet)
1 Image, 2*2 Stitchi	TriMouse-161	mAP	95.8	DLCRNet
1 Image, 2*2 Stitchi	TriMouse-161	mAP	86.8	CID-W32

Rethinking pose estimation in crowds: overcoming the detection information-bottleneck and ambiguity

Abstract

Results

Related Papers

Rethinking pose estimation in crowds: overcoming the detection information-bottleneck and ambiguity

Abstract

Results

Related Papers