TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Multi-modal Queried Object Detection in the Wild

Multi-modal Queried Object Detection in the Wild

Yifan Xu, Mengdan Zhang, Chaoyou Fu, Peixian Chen, Xiaoshan Yang, Ke Li, Changsheng Xu

2023-05-30NeurIPS 2023 11Few-Shot Object DetectionZero-Shot Object Detectionobject-detectionObject Detection
PaperPDFCode(official)

Abstract

We introduce MQ-Det, an efficient architecture and pre-training strategy design to utilize both textual description with open-set generalization and visual exemplars with rich description granularity as category queries, namely, Multi-modal Queried object Detection, for real-world detection with both open-vocabulary categories and various granularity. MQ-Det incorporates vision queries into existing well-established language-queried-only detectors. A plug-and-play gated class-scalable perceiver module upon the frozen detector is proposed to augment category text with class-wise visual information. To address the learning inertia problem brought by the frozen detector, a vision conditioned masked language prediction strategy is proposed. MQ-Det's simple yet effective architecture and training strategy design is compatible with most language-queried object detectors, thus yielding versatile applications. Experimental results demonstrate that multi-modal queries largely boost open-world detection. For instance, MQ-Det significantly improves the state-of-the-art open-set detector GLIP by +7.8% AP on the LVIS benchmark via multi-modal queries without any downstream finetuning, and averagely +6.3% AP on 13 few-shot downstream tasks, with merely additional 3% modulating time required by GLIP. Code is available at https://github.com/YifanXu74/MQ-Det.

Results

TaskDatasetMetricValueModel
Object DetectionODinW Full-Shot 13 TasksAP71.3MQ-GLIP-L
Object DetectionODinW-35Average Score43MQ-GLIP-T
Object DetectionODinW-13Average Score57MQ-GLIP-T
Object DetectionLVIS v1.0 minivalAP43.4MQ-GLIP-L
Object DetectionLVIS v1.0 minivalAP30.4MQ-GLIP-T
Object DetectionLVIS v1.0 minivalAP30.2MQ-GroundingDINO-T
Object DetectionLVIS v1.0 valAP34.7MQ-GLIP-L
Object DetectionLVIS v1.0 valAP22.6MQ-GLIP-T
Object DetectionLVIS v1.0 valAP22.1MQ-GroundingDINO-T
Object DetectionODinWAverage Score23.9MQ-GLIP-L
3DODinW Full-Shot 13 TasksAP71.3MQ-GLIP-L
3DODinW-35Average Score43MQ-GLIP-T
3DODinW-13Average Score57MQ-GLIP-T
3DLVIS v1.0 minivalAP43.4MQ-GLIP-L
3DLVIS v1.0 minivalAP30.4MQ-GLIP-T
3DLVIS v1.0 minivalAP30.2MQ-GroundingDINO-T
3DLVIS v1.0 valAP34.7MQ-GLIP-L
3DLVIS v1.0 valAP22.6MQ-GLIP-T
3DLVIS v1.0 valAP22.1MQ-GroundingDINO-T
3DODinWAverage Score23.9MQ-GLIP-L
Few-Shot Object DetectionODinW-35Average Score43MQ-GLIP-T
Few-Shot Object DetectionODinW-13Average Score57MQ-GLIP-T
2D ClassificationODinW Full-Shot 13 TasksAP71.3MQ-GLIP-L
2D ClassificationODinW-35Average Score43MQ-GLIP-T
2D ClassificationODinW-13Average Score57MQ-GLIP-T
2D ClassificationLVIS v1.0 minivalAP43.4MQ-GLIP-L
2D ClassificationLVIS v1.0 minivalAP30.4MQ-GLIP-T
2D ClassificationLVIS v1.0 minivalAP30.2MQ-GroundingDINO-T
2D ClassificationLVIS v1.0 valAP34.7MQ-GLIP-L
2D ClassificationLVIS v1.0 valAP22.6MQ-GLIP-T
2D ClassificationLVIS v1.0 valAP22.1MQ-GroundingDINO-T
2D ClassificationODinWAverage Score23.9MQ-GLIP-L
2D Object DetectionODinW Full-Shot 13 TasksAP71.3MQ-GLIP-L
2D Object DetectionODinW-35Average Score43MQ-GLIP-T
2D Object DetectionODinW-13Average Score57MQ-GLIP-T
2D Object DetectionLVIS v1.0 minivalAP43.4MQ-GLIP-L
2D Object DetectionLVIS v1.0 minivalAP30.4MQ-GLIP-T
2D Object DetectionLVIS v1.0 minivalAP30.2MQ-GroundingDINO-T
2D Object DetectionLVIS v1.0 valAP34.7MQ-GLIP-L
2D Object DetectionLVIS v1.0 valAP22.6MQ-GLIP-T
2D Object DetectionLVIS v1.0 valAP22.1MQ-GroundingDINO-T
2D Object DetectionODinWAverage Score23.9MQ-GLIP-L
16kODinW Full-Shot 13 TasksAP71.3MQ-GLIP-L
16kODinW-35Average Score43MQ-GLIP-T
16kODinW-13Average Score57MQ-GLIP-T
16kLVIS v1.0 minivalAP43.4MQ-GLIP-L
16kLVIS v1.0 minivalAP30.4MQ-GLIP-T
16kLVIS v1.0 minivalAP30.2MQ-GroundingDINO-T
16kLVIS v1.0 valAP34.7MQ-GLIP-L
16kLVIS v1.0 valAP22.6MQ-GLIP-T
16kLVIS v1.0 valAP22.1MQ-GroundingDINO-T
16kODinWAverage Score23.9MQ-GLIP-L

Related Papers

A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17RS-TinyNet: Stage-wise Feature Fusion Network for Detecting Tiny Objects in Remote Sensing Images2025-07-17Decoupled PROB: Decoupled Query Initialization Tasks and Objectness-Class Learning for Open World Object Detection2025-07-17Dual LiDAR-Based Traffic Movement Count Estimation at a Signalized Intersection: Deployment, Data Collection, and Preliminary Analysis2025-07-17Vision-based Perception for Autonomous Vehicles in Obstacle Avoidance Scenarios2025-07-16Tomato Multi-Angle Multi-Pose Dataset for Fine-Grained Phenotyping2025-07-15ECORE: Energy-Conscious Optimized Routing for Deep Learning Models at the Edge2025-07-08Beyond One Shot, Beyond One Perspective: Cross-View and Long-Horizon Distillation for Better LiDAR Representations2025-07-07