TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/CP-DETR: Concept Prompt Guide DETR Toward Stronger Univers...

CP-DETR: Concept Prompt Guide DETR Toward Stronger Universal Object Detection

Qibo Chen, Weizhong Jin, Jianyue Ge, Mengdi Liu, Yuchao Yan, Jian Jiang, Li Yu, Xuanjiang Guo, Shuchang Li, Jianzhong Chen

2024-12-13Zero-Shot Object Detectionobject-detectionObject Detection
PaperPDF

Abstract

Recent research on universal object detection aims to introduce language in a SoTA closed-set detector and then generalize the open-set concepts by constructing large-scale (text-region) datasets for training. However, these methods face two main challenges: (i) how to efficiently use the prior information in the prompts to genericise objects and (ii) how to reduce alignment bias in the downstream tasks, both leading to sub-optimal performance in some scenarios beyond pre-training. To address these challenges, we propose a strong universal detection foundation model called CP-DETR, which is competitive in almost all scenarios, with only one pre-training weight. Specifically, we design an efficient prompt visual hybrid encoder that enhances the information interaction between prompt and visual through scale-by-scale and multi-scale fusion modules. Then, the hybrid encoder is facilitated to fully utilize the prompted information by prompt multi-label loss and auxiliary detection head. In addition to text prompts, we have designed two practical concept prompt generation methods, visual prompt and optimized prompt, to extract abstract concepts through concrete visual examples and stably reduce alignment bias in downstream tasks. With these effective designs, CP-DETR demonstrates superior universal detection performance in a broad spectrum of scenarios. For example, our Swin-T backbone model achieves 47.6 zero-shot AP on LVIS, and the Swin-L backbone model achieves 32.2 zero-shot AP on ODinW35. Furthermore, our visual prompt generation method achieves 68.4 AP on COCO val by interactive detection, and the optimized prompt achieves 73.1 fully-shot AP on ODinW13.

Results

TaskDatasetMetricValueModel
Object DetectionLVIS v1.0 minivalbox AP69.2CP-DETR-L Swin-L(with chunk)
Object DetectionODinW Full-Shot 13 TasksAP73.1CP-DETR-L(only optimize prompt)
Object DetectionCOCO minivalbox AP64.1CP-DETR-L Swin-L(Fine tuning separately in COCO)
Object DetectionLVIS v1.0 minivalAP58.2CP-DETR-Pro(without LVIS data)
Object DetectionMSCOCOAP55.4CP-DETR-Pro((without COCO data))
Object DetectionLVIS v1.0 valAP51.6CP-DETR-Pro(without LVIS data)
Object DetectionODinWAverage Score32.2CP-DETR-L Swin-L
3DLVIS v1.0 minivalbox AP69.2CP-DETR-L Swin-L(with chunk)
3DODinW Full-Shot 13 TasksAP73.1CP-DETR-L(only optimize prompt)
3DCOCO minivalbox AP64.1CP-DETR-L Swin-L(Fine tuning separately in COCO)
3DLVIS v1.0 minivalAP58.2CP-DETR-Pro(without LVIS data)
3DMSCOCOAP55.4CP-DETR-Pro((without COCO data))
3DLVIS v1.0 valAP51.6CP-DETR-Pro(without LVIS data)
3DODinWAverage Score32.2CP-DETR-L Swin-L
2D ClassificationLVIS v1.0 minivalbox AP69.2CP-DETR-L Swin-L(with chunk)
2D ClassificationODinW Full-Shot 13 TasksAP73.1CP-DETR-L(only optimize prompt)
2D ClassificationCOCO minivalbox AP64.1CP-DETR-L Swin-L(Fine tuning separately in COCO)
2D ClassificationLVIS v1.0 minivalAP58.2CP-DETR-Pro(without LVIS data)
2D ClassificationMSCOCOAP55.4CP-DETR-Pro((without COCO data))
2D ClassificationLVIS v1.0 valAP51.6CP-DETR-Pro(without LVIS data)
2D ClassificationODinWAverage Score32.2CP-DETR-L Swin-L
2D Object DetectionLVIS v1.0 minivalbox AP69.2CP-DETR-L Swin-L(with chunk)
2D Object DetectionODinW Full-Shot 13 TasksAP73.1CP-DETR-L(only optimize prompt)
2D Object DetectionCOCO minivalbox AP64.1CP-DETR-L Swin-L(Fine tuning separately in COCO)
2D Object DetectionLVIS v1.0 minivalAP58.2CP-DETR-Pro(without LVIS data)
2D Object DetectionMSCOCOAP55.4CP-DETR-Pro((without COCO data))
2D Object DetectionLVIS v1.0 valAP51.6CP-DETR-Pro(without LVIS data)
2D Object DetectionODinWAverage Score32.2CP-DETR-L Swin-L
16kLVIS v1.0 minivalbox AP69.2CP-DETR-L Swin-L(with chunk)
16kODinW Full-Shot 13 TasksAP73.1CP-DETR-L(only optimize prompt)
16kCOCO minivalbox AP64.1CP-DETR-L Swin-L(Fine tuning separately in COCO)
16kLVIS v1.0 minivalAP58.2CP-DETR-Pro(without LVIS data)
16kMSCOCOAP55.4CP-DETR-Pro((without COCO data))
16kLVIS v1.0 valAP51.6CP-DETR-Pro(without LVIS data)
16kODinWAverage Score32.2CP-DETR-L Swin-L

Related Papers

A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17RS-TinyNet: Stage-wise Feature Fusion Network for Detecting Tiny Objects in Remote Sensing Images2025-07-17Decoupled PROB: Decoupled Query Initialization Tasks and Objectness-Class Learning for Open World Object Detection2025-07-17Dual LiDAR-Based Traffic Movement Count Estimation at a Signalized Intersection: Deployment, Data Collection, and Preliminary Analysis2025-07-17Vision-based Perception for Autonomous Vehicles in Obstacle Avoidance Scenarios2025-07-16Tomato Multi-Angle Multi-Pose Dataset for Fine-Grained Phenotyping2025-07-15ECORE: Energy-Conscious Optimized Routing for Deep Learning Models at the Edge2025-07-08Beyond One Shot, Beyond One Perspective: Cross-View and Long-Horizon Distillation for Better LiDAR Representations2025-07-07