TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Class-agnostic Object Detection with Multi-modal Transformer

Class-agnostic Object Detection with Multi-modal Transformer

Muhammad Maaz, Hanoona Rasheed, Salman Khan, Fahad Shahbaz Khan, Rao Muhammad Anwer, Ming-Hsuan Yang

2021-11-22Open World Object DetectionClass-agnostic Object Detectionobject-detectionObject Proposal GenerationObject Detection
PaperPDFCode(official)

Abstract

What constitutes an object? This has been a long-standing question in computer vision. Towards this goal, numerous learning-free and learning-based approaches have been developed to score objectness. However, they generally do not scale well across new domains and novel objects. In this paper, we advocate that existing methods lack a top-down supervision signal governed by human-understandable semantics. For the first time in literature, we demonstrate that Multi-modal Vision Transformers (MViT) trained with aligned image-text pairs can effectively bridge this gap. Our extensive experiments across various domains and novel objects show the state-of-the-art performance of MViTs to localize generic objects in images. Based on the observation that existing MViTs do not include multi-scale feature processing and usually require longer training schedules, we develop an efficient MViT architecture using multi-scale deformable attention and late vision-language fusion. We show the significance of MViT proposals in a diverse range of applications including open-world object detection, salient and camouflage object detection, supervised and self-supervised detection tasks. Further, MViTs can adaptively generate proposals given a specific language query and thus offer enhanced interactability. Code: \url{https://git.io/J1HPY}.

Results

TaskDatasetMetricValueModel
Object DetectionPASCAL VOC 10%AP58.78DETReg (MDef-DETR)
Object DetectionPASCAL VOC 10%AP5080.46DETReg (MDef-DETR)
Object DetectionPASCAL VOC 10%AP7565.65DETReg (MDef-DETR)
Object DetectionPASCAL VOC 2007AP5084.16DETReg (MDef-DETR)
Object DetectionCOCO 2017 (Electronic, Indoor, Kitchen, Furniture)MAP31.66ORE (MDef-DETR)
Object DetectionCOCO 2017 (Sports, Food)A-OSE4117ORE (MDef-DETR)
Object DetectionCOCO 2017 (Sports, Food)MAP36.75ORE (MDef-DETR)
Object DetectionCOCO 2017 (Sports, Food)Unknown Recall50.89ORE (MDef-DETR)
Object DetectionCOCO 2017 (Sports, Food)WI0.0179ORE (MDef-DETR)
Object DetectionCOCO 2017 (Outdoor, Accessories, Appliance, Truck)A-OSE5212ORE (MDef-DETR)
Object DetectionCOCO 2017 (Outdoor, Accessories, Appliance, Truck)MAP46.19ORE (MDef-DETR)
Object DetectionCOCO 2017 (Outdoor, Accessories, Appliance, Truck)Unknown Recall49.54ORE (MDef-DETR)
Object DetectionCOCO 2017 (Outdoor, Accessories, Appliance, Truck)WI0.0251ORE (MDef-DETR)
Object DetectionPASCAL VOC 2007A-OSE7322ORE (MDef-DETR)
Object DetectionPASCAL VOC 2007MAP64.03ORE (MDef-DETR)
Object DetectionPASCAL VOC 2007Unknown Recall50.13ORE (MDef-DETR)
Object DetectionPASCAL VOC 2007WI0.0474ORE (MDef-DETR)
Object DetectionPASCAL VOC 2012, 60 proposals per imageAverage Recall0.9126MDef-DETR
Object DetectionCOCO (Common Objects in Context)Average Recall0.6503MDef-DETR (Off-the-shelf evaluation)
3DPASCAL VOC 10%AP58.78DETReg (MDef-DETR)
3DPASCAL VOC 10%AP5080.46DETReg (MDef-DETR)
3DPASCAL VOC 10%AP7565.65DETReg (MDef-DETR)
3DPASCAL VOC 2007AP5084.16DETReg (MDef-DETR)
3DCOCO 2017 (Electronic, Indoor, Kitchen, Furniture)MAP31.66ORE (MDef-DETR)
3DCOCO 2017 (Sports, Food)A-OSE4117ORE (MDef-DETR)
3DCOCO 2017 (Sports, Food)MAP36.75ORE (MDef-DETR)
3DCOCO 2017 (Sports, Food)Unknown Recall50.89ORE (MDef-DETR)
3DCOCO 2017 (Sports, Food)WI0.0179ORE (MDef-DETR)
3DCOCO 2017 (Outdoor, Accessories, Appliance, Truck)A-OSE5212ORE (MDef-DETR)
3DCOCO 2017 (Outdoor, Accessories, Appliance, Truck)MAP46.19ORE (MDef-DETR)
3DCOCO 2017 (Outdoor, Accessories, Appliance, Truck)Unknown Recall49.54ORE (MDef-DETR)
3DCOCO 2017 (Outdoor, Accessories, Appliance, Truck)WI0.0251ORE (MDef-DETR)
3DPASCAL VOC 2007A-OSE7322ORE (MDef-DETR)
3DPASCAL VOC 2007MAP64.03ORE (MDef-DETR)
3DPASCAL VOC 2007Unknown Recall50.13ORE (MDef-DETR)
3DPASCAL VOC 2007WI0.0474ORE (MDef-DETR)
3DPASCAL VOC 2012, 60 proposals per imageAverage Recall0.9126MDef-DETR
3DCOCO (Common Objects in Context)Average Recall0.6503MDef-DETR (Off-the-shelf evaluation)
2D ClassificationPASCAL VOC 10%AP58.78DETReg (MDef-DETR)
2D ClassificationPASCAL VOC 10%AP5080.46DETReg (MDef-DETR)
2D ClassificationPASCAL VOC 10%AP7565.65DETReg (MDef-DETR)
2D ClassificationPASCAL VOC 2007AP5084.16DETReg (MDef-DETR)
2D ClassificationCOCO 2017 (Electronic, Indoor, Kitchen, Furniture)MAP31.66ORE (MDef-DETR)
2D ClassificationCOCO 2017 (Sports, Food)A-OSE4117ORE (MDef-DETR)
2D ClassificationCOCO 2017 (Sports, Food)MAP36.75ORE (MDef-DETR)
2D ClassificationCOCO 2017 (Sports, Food)Unknown Recall50.89ORE (MDef-DETR)
2D ClassificationCOCO 2017 (Sports, Food)WI0.0179ORE (MDef-DETR)
2D ClassificationCOCO 2017 (Outdoor, Accessories, Appliance, Truck)A-OSE5212ORE (MDef-DETR)
2D ClassificationCOCO 2017 (Outdoor, Accessories, Appliance, Truck)MAP46.19ORE (MDef-DETR)
2D ClassificationCOCO 2017 (Outdoor, Accessories, Appliance, Truck)Unknown Recall49.54ORE (MDef-DETR)
2D ClassificationCOCO 2017 (Outdoor, Accessories, Appliance, Truck)WI0.0251ORE (MDef-DETR)
2D ClassificationPASCAL VOC 2007A-OSE7322ORE (MDef-DETR)
2D ClassificationPASCAL VOC 2007MAP64.03ORE (MDef-DETR)
2D ClassificationPASCAL VOC 2007Unknown Recall50.13ORE (MDef-DETR)
2D ClassificationPASCAL VOC 2007WI0.0474ORE (MDef-DETR)
2D ClassificationPASCAL VOC 2012, 60 proposals per imageAverage Recall0.9126MDef-DETR
2D ClassificationCOCO (Common Objects in Context)Average Recall0.6503MDef-DETR (Off-the-shelf evaluation)
2D Object DetectionPASCAL VOC 10%AP58.78DETReg (MDef-DETR)
2D Object DetectionPASCAL VOC 10%AP5080.46DETReg (MDef-DETR)
2D Object DetectionPASCAL VOC 10%AP7565.65DETReg (MDef-DETR)
2D Object DetectionPASCAL VOC 2007AP5084.16DETReg (MDef-DETR)
2D Object DetectionCOCO 2017 (Electronic, Indoor, Kitchen, Furniture)MAP31.66ORE (MDef-DETR)
2D Object DetectionCOCO 2017 (Sports, Food)A-OSE4117ORE (MDef-DETR)
2D Object DetectionCOCO 2017 (Sports, Food)MAP36.75ORE (MDef-DETR)
2D Object DetectionCOCO 2017 (Sports, Food)Unknown Recall50.89ORE (MDef-DETR)
2D Object DetectionCOCO 2017 (Sports, Food)WI0.0179ORE (MDef-DETR)
2D Object DetectionCOCO 2017 (Outdoor, Accessories, Appliance, Truck)A-OSE5212ORE (MDef-DETR)
2D Object DetectionCOCO 2017 (Outdoor, Accessories, Appliance, Truck)MAP46.19ORE (MDef-DETR)
2D Object DetectionCOCO 2017 (Outdoor, Accessories, Appliance, Truck)Unknown Recall49.54ORE (MDef-DETR)
2D Object DetectionCOCO 2017 (Outdoor, Accessories, Appliance, Truck)WI0.0251ORE (MDef-DETR)
2D Object DetectionPASCAL VOC 2007A-OSE7322ORE (MDef-DETR)
2D Object DetectionPASCAL VOC 2007MAP64.03ORE (MDef-DETR)
2D Object DetectionPASCAL VOC 2007Unknown Recall50.13ORE (MDef-DETR)
2D Object DetectionPASCAL VOC 2007WI0.0474ORE (MDef-DETR)
2D Object DetectionPASCAL VOC 2012, 60 proposals per imageAverage Recall0.9126MDef-DETR
2D Object DetectionCOCO (Common Objects in Context)Average Recall0.6503MDef-DETR (Off-the-shelf evaluation)
16kPASCAL VOC 10%AP58.78DETReg (MDef-DETR)
16kPASCAL VOC 10%AP5080.46DETReg (MDef-DETR)
16kPASCAL VOC 10%AP7565.65DETReg (MDef-DETR)
16kPASCAL VOC 2007AP5084.16DETReg (MDef-DETR)
16kCOCO 2017 (Electronic, Indoor, Kitchen, Furniture)MAP31.66ORE (MDef-DETR)
16kCOCO 2017 (Sports, Food)A-OSE4117ORE (MDef-DETR)
16kCOCO 2017 (Sports, Food)MAP36.75ORE (MDef-DETR)
16kCOCO 2017 (Sports, Food)Unknown Recall50.89ORE (MDef-DETR)
16kCOCO 2017 (Sports, Food)WI0.0179ORE (MDef-DETR)
16kCOCO 2017 (Outdoor, Accessories, Appliance, Truck)A-OSE5212ORE (MDef-DETR)
16kCOCO 2017 (Outdoor, Accessories, Appliance, Truck)MAP46.19ORE (MDef-DETR)
16kCOCO 2017 (Outdoor, Accessories, Appliance, Truck)Unknown Recall49.54ORE (MDef-DETR)
16kCOCO 2017 (Outdoor, Accessories, Appliance, Truck)WI0.0251ORE (MDef-DETR)
16kPASCAL VOC 2007A-OSE7322ORE (MDef-DETR)
16kPASCAL VOC 2007MAP64.03ORE (MDef-DETR)
16kPASCAL VOC 2007Unknown Recall50.13ORE (MDef-DETR)
16kPASCAL VOC 2007WI0.0474ORE (MDef-DETR)
16kPASCAL VOC 2012, 60 proposals per imageAverage Recall0.9126MDef-DETR
16kCOCO (Common Objects in Context)Average Recall0.6503MDef-DETR (Off-the-shelf evaluation)

Related Papers

Decoupled PROB: Decoupled Query Initialization Tasks and Objectness-Class Learning for Open World Object Detection2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17RS-TinyNet: Stage-wise Feature Fusion Network for Detecting Tiny Objects in Remote Sensing Images2025-07-17Dual LiDAR-Based Traffic Movement Count Estimation at a Signalized Intersection: Deployment, Data Collection, and Preliminary Analysis2025-07-17Vision-based Perception for Autonomous Vehicles in Obstacle Avoidance Scenarios2025-07-16Tomato Multi-Angle Multi-Pose Dataset for Fine-Grained Phenotyping2025-07-15ECORE: Energy-Conscious Optimized Routing for Deep Learning Models at the Edge2025-07-08Beyond One Shot, Beyond One Perspective: Cross-View and Long-Horizon Distillation for Better LiDAR Representations2025-07-07