OV-DQUO: Open-Vocabulary DETR with Denoising Text Query Training and Open-World Unknown Objects Supervision

Junjie Wang, Bin Chen, Bin Kang, Yulin Li, YiChi Chen, Weizhi Xian, Huifeng Chang, Yong Xu

2024-05-28Denoising Zero-Shot Object Detection Contrastive Learning Open Vocabulary Object Detection object-detection Object Detection

Paper PDF Code(official)

Abstract

Open-vocabulary detection aims to detect objects from novel categories beyond the base categories on which the detector is trained. However, existing open-vocabulary detectors trained on base category data tend to assign higher confidence to trained categories and confuse novel categories with the background. To resolve this, we propose OV-DQUO, an \textbf{O}pen-\textbf{V}ocabulary DETR with \textbf{D}enoising text \textbf{Q}uery training and open-world \textbf{U}nknown \textbf{O}bjects supervision. Specifically, we introduce a wildcard matching method. This method enables the detector to learn from pairs of unknown objects recognized by the open-world detector and text embeddings with general semantics, mitigating the confidence bias between base and novel categories. Additionally, we propose a denoising text query training strategy. It synthesizes foreground and background query-box pairs from open-world unknown objects to train the detector through contrastive learning, enhancing its ability to distinguish novel objects from the background. We conducted extensive experiments on the challenging OV-COCO and OV-LVIS benchmarks, achieving new state-of-the-art results of 45.6 AP50 and 39.3 mAP on novel categories respectively, without the need for additional training data. Models and code are released at \url{https://github.com/xiaomoguhz/OV-DQUO}

Results

Task	Dataset	Metric	Value	Model
Object Detection	LVIS v1.0	AP novel-LVIS base training	39.3	OV-DQUO(ViT-L/14)
Object Detection	LVIS v1.0	AP novel-LVIS base training	29.7	OV-DQUO(ViT-B/16)
Object Detection	MSCOCO	AP 0.5	45.6	OV-DQUO(RN50x4)
Object Detection	MSCOCO	AP 0.5	39.2	OV-DQUO(R50)
3D	LVIS v1.0	AP novel-LVIS base training	39.3	OV-DQUO(ViT-L/14)
3D	LVIS v1.0	AP novel-LVIS base training	29.7	OV-DQUO(ViT-B/16)
3D	MSCOCO	AP 0.5	45.6	OV-DQUO(RN50x4)
3D	MSCOCO	AP 0.5	39.2	OV-DQUO(R50)
2D Classification	LVIS v1.0	AP novel-LVIS base training	39.3	OV-DQUO(ViT-L/14)
2D Classification	LVIS v1.0	AP novel-LVIS base training	29.7	OV-DQUO(ViT-B/16)
2D Classification	MSCOCO	AP 0.5	45.6	OV-DQUO(RN50x4)
2D Classification	MSCOCO	AP 0.5	39.2	OV-DQUO(R50)
2D Object Detection	LVIS v1.0	AP novel-LVIS base training	39.3	OV-DQUO(ViT-L/14)
2D Object Detection	LVIS v1.0	AP novel-LVIS base training	29.7	OV-DQUO(ViT-B/16)
2D Object Detection	MSCOCO	AP 0.5	45.6	OV-DQUO(RN50x4)
2D Object Detection	MSCOCO	AP 0.5	39.2	OV-DQUO(R50)
Open Vocabulary Object Detection	LVIS v1.0	AP novel-LVIS base training	39.3	OV-DQUO(ViT-L/14)
Open Vocabulary Object Detection	LVIS v1.0	AP novel-LVIS base training	29.7	OV-DQUO(ViT-B/16)
Open Vocabulary Object Detection	MSCOCO	AP 0.5	45.6	OV-DQUO(RN50x4)
Open Vocabulary Object Detection	MSCOCO	AP 0.5	39.2	OV-DQUO(R50)
16k	LVIS v1.0	AP novel-LVIS base training	39.3	OV-DQUO(ViT-L/14)
16k	LVIS v1.0	AP novel-LVIS base training	29.7	OV-DQUO(ViT-B/16)
16k	MSCOCO	AP 0.5	45.6	OV-DQUO(RN50x4)
16k	MSCOCO	AP 0.5	39.2	OV-DQUO(R50)

OV-DQUO: Open-Vocabulary DETR with Denoising Text Query Training and Open-World Unknown Objects Supervision

Abstract

Results

Related Papers

OV-DQUO: Open-Vocabulary DETR with Denoising Text Query Training and Open-World Unknown Objects Supervision

Abstract

Results

Related Papers