Yifan Xu, Mengdan Zhang, Chaoyou Fu, Peixian Chen, Xiaoshan Yang, Ke Li, Changsheng Xu
We introduce MQ-Det, an efficient architecture and pre-training strategy design to utilize both textual description with open-set generalization and visual exemplars with rich description granularity as category queries, namely, Multi-modal Queried object Detection, for real-world detection with both open-vocabulary categories and various granularity. MQ-Det incorporates vision queries into existing well-established language-queried-only detectors. A plug-and-play gated class-scalable perceiver module upon the frozen detector is proposed to augment category text with class-wise visual information. To address the learning inertia problem brought by the frozen detector, a vision conditioned masked language prediction strategy is proposed. MQ-Det's simple yet effective architecture and training strategy design is compatible with most language-queried object detectors, thus yielding versatile applications. Experimental results demonstrate that multi-modal queries largely boost open-world detection. For instance, MQ-Det significantly improves the state-of-the-art open-set detector GLIP by +7.8% AP on the LVIS benchmark via multi-modal queries without any downstream finetuning, and averagely +6.3% AP on 13 few-shot downstream tasks, with merely additional 3% modulating time required by GLIP. Code is available at https://github.com/YifanXu74/MQ-Det.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Object Detection | ODinW Full-Shot 13 Tasks | AP | 71.3 | MQ-GLIP-L |
| Object Detection | ODinW-35 | Average Score | 43 | MQ-GLIP-T |
| Object Detection | ODinW-13 | Average Score | 57 | MQ-GLIP-T |
| Object Detection | LVIS v1.0 minival | AP | 43.4 | MQ-GLIP-L |
| Object Detection | LVIS v1.0 minival | AP | 30.4 | MQ-GLIP-T |
| Object Detection | LVIS v1.0 minival | AP | 30.2 | MQ-GroundingDINO-T |
| Object Detection | LVIS v1.0 val | AP | 34.7 | MQ-GLIP-L |
| Object Detection | LVIS v1.0 val | AP | 22.6 | MQ-GLIP-T |
| Object Detection | LVIS v1.0 val | AP | 22.1 | MQ-GroundingDINO-T |
| Object Detection | ODinW | Average Score | 23.9 | MQ-GLIP-L |
| 3D | ODinW Full-Shot 13 Tasks | AP | 71.3 | MQ-GLIP-L |
| 3D | ODinW-35 | Average Score | 43 | MQ-GLIP-T |
| 3D | ODinW-13 | Average Score | 57 | MQ-GLIP-T |
| 3D | LVIS v1.0 minival | AP | 43.4 | MQ-GLIP-L |
| 3D | LVIS v1.0 minival | AP | 30.4 | MQ-GLIP-T |
| 3D | LVIS v1.0 minival | AP | 30.2 | MQ-GroundingDINO-T |
| 3D | LVIS v1.0 val | AP | 34.7 | MQ-GLIP-L |
| 3D | LVIS v1.0 val | AP | 22.6 | MQ-GLIP-T |
| 3D | LVIS v1.0 val | AP | 22.1 | MQ-GroundingDINO-T |
| 3D | ODinW | Average Score | 23.9 | MQ-GLIP-L |
| Few-Shot Object Detection | ODinW-35 | Average Score | 43 | MQ-GLIP-T |
| Few-Shot Object Detection | ODinW-13 | Average Score | 57 | MQ-GLIP-T |
| 2D Classification | ODinW Full-Shot 13 Tasks | AP | 71.3 | MQ-GLIP-L |
| 2D Classification | ODinW-35 | Average Score | 43 | MQ-GLIP-T |
| 2D Classification | ODinW-13 | Average Score | 57 | MQ-GLIP-T |
| 2D Classification | LVIS v1.0 minival | AP | 43.4 | MQ-GLIP-L |
| 2D Classification | LVIS v1.0 minival | AP | 30.4 | MQ-GLIP-T |
| 2D Classification | LVIS v1.0 minival | AP | 30.2 | MQ-GroundingDINO-T |
| 2D Classification | LVIS v1.0 val | AP | 34.7 | MQ-GLIP-L |
| 2D Classification | LVIS v1.0 val | AP | 22.6 | MQ-GLIP-T |
| 2D Classification | LVIS v1.0 val | AP | 22.1 | MQ-GroundingDINO-T |
| 2D Classification | ODinW | Average Score | 23.9 | MQ-GLIP-L |
| 2D Object Detection | ODinW Full-Shot 13 Tasks | AP | 71.3 | MQ-GLIP-L |
| 2D Object Detection | ODinW-35 | Average Score | 43 | MQ-GLIP-T |
| 2D Object Detection | ODinW-13 | Average Score | 57 | MQ-GLIP-T |
| 2D Object Detection | LVIS v1.0 minival | AP | 43.4 | MQ-GLIP-L |
| 2D Object Detection | LVIS v1.0 minival | AP | 30.4 | MQ-GLIP-T |
| 2D Object Detection | LVIS v1.0 minival | AP | 30.2 | MQ-GroundingDINO-T |
| 2D Object Detection | LVIS v1.0 val | AP | 34.7 | MQ-GLIP-L |
| 2D Object Detection | LVIS v1.0 val | AP | 22.6 | MQ-GLIP-T |
| 2D Object Detection | LVIS v1.0 val | AP | 22.1 | MQ-GroundingDINO-T |
| 2D Object Detection | ODinW | Average Score | 23.9 | MQ-GLIP-L |
| 16k | ODinW Full-Shot 13 Tasks | AP | 71.3 | MQ-GLIP-L |
| 16k | ODinW-35 | Average Score | 43 | MQ-GLIP-T |
| 16k | ODinW-13 | Average Score | 57 | MQ-GLIP-T |
| 16k | LVIS v1.0 minival | AP | 43.4 | MQ-GLIP-L |
| 16k | LVIS v1.0 minival | AP | 30.4 | MQ-GLIP-T |
| 16k | LVIS v1.0 minival | AP | 30.2 | MQ-GroundingDINO-T |
| 16k | LVIS v1.0 val | AP | 34.7 | MQ-GLIP-L |
| 16k | LVIS v1.0 val | AP | 22.6 | MQ-GLIP-T |
| 16k | LVIS v1.0 val | AP | 22.1 | MQ-GroundingDINO-T |
| 16k | ODinW | Average Score | 23.9 | MQ-GLIP-L |