CP-DETR: Concept Prompt Guide DETR Toward Stronger Universal Object Detection

Qibo Chen, Weizhong Jin, Jianyue Ge, Mengdi Liu, Yuchao Yan, Jian Jiang, Li Yu, Xuanjiang Guo, Shuchang Li, Jianzhong Chen

2024-12-13Zero-Shot Object Detection object-detection Object Detection

Paper PDF

Abstract

Recent research on universal object detection aims to introduce language in a SoTA closed-set detector and then generalize the open-set concepts by constructing large-scale (text-region) datasets for training. However, these methods face two main challenges: (i) how to efficiently use the prior information in the prompts to genericise objects and (ii) how to reduce alignment bias in the downstream tasks, both leading to sub-optimal performance in some scenarios beyond pre-training. To address these challenges, we propose a strong universal detection foundation model called CP-DETR, which is competitive in almost all scenarios, with only one pre-training weight. Specifically, we design an efficient prompt visual hybrid encoder that enhances the information interaction between prompt and visual through scale-by-scale and multi-scale fusion modules. Then, the hybrid encoder is facilitated to fully utilize the prompted information by prompt multi-label loss and auxiliary detection head. In addition to text prompts, we have designed two practical concept prompt generation methods, visual prompt and optimized prompt, to extract abstract concepts through concrete visual examples and stably reduce alignment bias in downstream tasks. With these effective designs, CP-DETR demonstrates superior universal detection performance in a broad spectrum of scenarios. For example, our Swin-T backbone model achieves 47.6 zero-shot AP on LVIS, and the Swin-L backbone model achieves 32.2 zero-shot AP on ODinW35. Furthermore, our visual prompt generation method achieves 68.4 AP on COCO val by interactive detection, and the optimized prompt achieves 73.1 fully-shot AP on ODinW13.

Results

Task	Dataset	Metric	Value	Model
Object Detection	LVIS v1.0 minival	box AP	69.2	CP-DETR-L Swin-L(with chunk)
Object Detection	ODinW Full-Shot 13 Tasks	AP	73.1	CP-DETR-L(only optimize prompt)
Object Detection	COCO minival	box AP	64.1	CP-DETR-L Swin-L(Fine tuning separately in COCO)
Object Detection	LVIS v1.0 minival	AP	58.2	CP-DETR-Pro(without LVIS data)
Object Detection	MSCOCO	AP	55.4	CP-DETR-Pro((without COCO data))
Object Detection	LVIS v1.0 val	AP	51.6	CP-DETR-Pro(without LVIS data)
Object Detection	ODinW	Average Score	32.2	CP-DETR-L Swin-L
3D	LVIS v1.0 minival	box AP	69.2	CP-DETR-L Swin-L(with chunk)
3D	ODinW Full-Shot 13 Tasks	AP	73.1	CP-DETR-L(only optimize prompt)
3D	COCO minival	box AP	64.1	CP-DETR-L Swin-L(Fine tuning separately in COCO)
3D	LVIS v1.0 minival	AP	58.2	CP-DETR-Pro(without LVIS data)
3D	MSCOCO	AP	55.4	CP-DETR-Pro((without COCO data))
3D	LVIS v1.0 val	AP	51.6	CP-DETR-Pro(without LVIS data)
3D	ODinW	Average Score	32.2	CP-DETR-L Swin-L
2D Classification	LVIS v1.0 minival	box AP	69.2	CP-DETR-L Swin-L(with chunk)
2D Classification	ODinW Full-Shot 13 Tasks	AP	73.1	CP-DETR-L(only optimize prompt)
2D Classification	COCO minival	box AP	64.1	CP-DETR-L Swin-L(Fine tuning separately in COCO)
2D Classification	LVIS v1.0 minival	AP	58.2	CP-DETR-Pro(without LVIS data)
2D Classification	MSCOCO	AP	55.4	CP-DETR-Pro((without COCO data))
2D Classification	LVIS v1.0 val	AP	51.6	CP-DETR-Pro(without LVIS data)
2D Classification	ODinW	Average Score	32.2	CP-DETR-L Swin-L
2D Object Detection	LVIS v1.0 minival	box AP	69.2	CP-DETR-L Swin-L(with chunk)
2D Object Detection	ODinW Full-Shot 13 Tasks	AP	73.1	CP-DETR-L(only optimize prompt)
2D Object Detection	COCO minival	box AP	64.1	CP-DETR-L Swin-L(Fine tuning separately in COCO)
2D Object Detection	LVIS v1.0 minival	AP	58.2	CP-DETR-Pro(without LVIS data)
2D Object Detection	MSCOCO	AP	55.4	CP-DETR-Pro((without COCO data))
2D Object Detection	LVIS v1.0 val	AP	51.6	CP-DETR-Pro(without LVIS data)
2D Object Detection	ODinW	Average Score	32.2	CP-DETR-L Swin-L
16k	LVIS v1.0 minival	box AP	69.2	CP-DETR-L Swin-L(with chunk)
16k	ODinW Full-Shot 13 Tasks	AP	73.1	CP-DETR-L(only optimize prompt)
16k	COCO minival	box AP	64.1	CP-DETR-L Swin-L(Fine tuning separately in COCO)
16k	LVIS v1.0 minival	AP	58.2	CP-DETR-Pro(without LVIS data)
16k	MSCOCO	AP	55.4	CP-DETR-Pro((without COCO data))
16k	LVIS v1.0 val	AP	51.6	CP-DETR-Pro(without LVIS data)
16k	ODinW	Average Score	32.2	CP-DETR-L Swin-L

CP-DETR: Concept Prompt Guide DETR Toward Stronger Universal Object Detection

Abstract

Results

Related Papers

CP-DETR: Concept Prompt Guide DETR Toward Stronger Universal Object Detection

Abstract

Results

Related Papers