Dense Distinct Query for End-to-End Object Detection

Shilong Zhang, Xinjiang Wang, Jiaqi Wang, Jiangmiao Pang, Chengqi Lyu, Wenwei Zhang, Ping Luo, Kai Chen

2023-03-22CVPR 2023 1object-detection Object Detection

Abstract

One-to-one label assignment in object detection has successfully obviated the need for non-maximum suppression (NMS) as postprocessing and makes the pipeline end-to-end. However, it triggers a new dilemma as the widely used sparse queries cannot guarantee a high recall, while dense queries inevitably bring more similar queries and encounter optimization difficulties. As both sparse and dense queries are problematic, then what are the expected queries in end-to-end object detection? This paper shows that the solution should be Dense Distinct Queries (DDQ). Concretely, we first lay dense queries like traditional detectors and then select distinct ones for one-to-one assignments. DDQ blends the advantages of traditional and recent end-to-end detectors and significantly improves the performance of various detectors including FCN, R-CNN, and DETRs. Most impressively, DDQ-DETR achieves 52.1 AP on MS-COCO dataset within 12 epochs using a ResNet-50 backbone, outperforming all existing detectors in the same setting. DDQ also shares the benefit of end-to-end detectors in crowded scenes and achieves 93.8 AP on CrowdHuman. We hope DDQ can inspire researchers to consider the complementarity between traditional methods and end-to-end detectors. The source code can be found at \url{https://github.com/jshilong/DDQ}.

Results

Task	Dataset	Metric	Value	Model
Object Detection	CrowdHuman (full body)	AP	93.8	DDQ DETR (R50)
Object Detection	CrowdHuman (full body)	Recall	98.7	DDQ DETR (R50)
Object Detection	CrowdHuman (full body)	mMR	39.7	DDQ DETR (R50)
Object Detection	CrowdHuman (full body)	AP	93.5	DDQ R-CNN (R50)
Object Detection	CrowdHuman (full body)	Recall	98.6	DDQ R-CNN (R50)
Object Detection	CrowdHuman (full body)	mMR	40.4	DDQ R-CNN (R50)
Object Detection	CrowdHuman (full body)	AP	92.7	DDQ FCN (R50 One-Stage)
Object Detection	CrowdHuman (full body)	Recall	98.2	DDQ FCN (R50 One-Stage)
Object Detection	CrowdHuman (full body)	mMR	41	DDQ FCN (R50 One-Stage)
3D	CrowdHuman (full body)	AP	93.8	DDQ DETR (R50)
3D	CrowdHuman (full body)	Recall	98.7	DDQ DETR (R50)
3D	CrowdHuman (full body)	mMR	39.7	DDQ DETR (R50)
3D	CrowdHuman (full body)	AP	93.5	DDQ R-CNN (R50)
3D	CrowdHuman (full body)	Recall	98.6	DDQ R-CNN (R50)
3D	CrowdHuman (full body)	mMR	40.4	DDQ R-CNN (R50)
3D	CrowdHuman (full body)	AP	92.7	DDQ FCN (R50 One-Stage)
3D	CrowdHuman (full body)	Recall	98.2	DDQ FCN (R50 One-Stage)
3D	CrowdHuman (full body)	mMR	41	DDQ FCN (R50 One-Stage)
2D Classification	CrowdHuman (full body)	AP	93.8	DDQ DETR (R50)
2D Classification	CrowdHuman (full body)	Recall	98.7	DDQ DETR (R50)
2D Classification	CrowdHuman (full body)	mMR	39.7	DDQ DETR (R50)
2D Classification	CrowdHuman (full body)	AP	93.5	DDQ R-CNN (R50)
2D Classification	CrowdHuman (full body)	Recall	98.6	DDQ R-CNN (R50)
2D Classification	CrowdHuman (full body)	mMR	40.4	DDQ R-CNN (R50)
2D Classification	CrowdHuman (full body)	AP	92.7	DDQ FCN (R50 One-Stage)
2D Classification	CrowdHuman (full body)	Recall	98.2	DDQ FCN (R50 One-Stage)
2D Classification	CrowdHuman (full body)	mMR	41	DDQ FCN (R50 One-Stage)
2D Object Detection	CrowdHuman (full body)	AP	93.8	DDQ DETR (R50)
2D Object Detection	CrowdHuman (full body)	Recall	98.7	DDQ DETR (R50)
2D Object Detection	CrowdHuman (full body)	mMR	39.7	DDQ DETR (R50)
2D Object Detection	CrowdHuman (full body)	AP	93.5	DDQ R-CNN (R50)
2D Object Detection	CrowdHuman (full body)	Recall	98.6	DDQ R-CNN (R50)
2D Object Detection	CrowdHuman (full body)	mMR	40.4	DDQ R-CNN (R50)
2D Object Detection	CrowdHuman (full body)	AP	92.7	DDQ FCN (R50 One-Stage)
2D Object Detection	CrowdHuman (full body)	Recall	98.2	DDQ FCN (R50 One-Stage)
2D Object Detection	CrowdHuman (full body)	mMR	41	DDQ FCN (R50 One-Stage)
16k	CrowdHuman (full body)	AP	93.8	DDQ DETR (R50)
16k	CrowdHuman (full body)	Recall	98.7	DDQ DETR (R50)
16k	CrowdHuman (full body)	mMR	39.7	DDQ DETR (R50)
16k	CrowdHuman (full body)	AP	93.5	DDQ R-CNN (R50)
16k	CrowdHuman (full body)	Recall	98.6	DDQ R-CNN (R50)
16k	CrowdHuman (full body)	mMR	40.4	DDQ R-CNN (R50)
16k	CrowdHuman (full body)	AP	92.7	DDQ FCN (R50 One-Stage)
16k	CrowdHuman (full body)	Recall	98.2	DDQ FCN (R50 One-Stage)
16k	CrowdHuman (full body)	mMR	41	DDQ FCN (R50 One-Stage)

Dense Distinct Query for End-to-End Object Detection

Abstract

Results

Related Papers

Dense Distinct Query for End-to-End Object Detection

Abstract

Results

Related Papers