TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Localized Vision-Language Matching for Open-vocabulary Obj...

Localized Vision-Language Matching for Open-vocabulary Object Detection

Maria A. Bravo, Sudhanshu Mittal, Thomas Brox

2022-05-12Open World Object DetectionOpen Vocabulary Attribute DetectionOpen Vocabulary Object Detectionobject-detectionObject DetectionLanguage Modelling
PaperPDFCode(official)

Abstract

In this work, we propose an open-vocabulary object detection method that, based on image-caption pairs, learns to detect novel object classes along with a given set of known classes. It is a two-stage training approach that first uses a location-guided image-caption matching technique to learn class labels for both novel and known classes in a weakly-supervised manner and second specializes the model for the object detection task using known class annotations. We show that a simple language model fits better than a large contextualized language model for detecting novel objects. Moreover, we introduce a consistency-regularization technique to better exploit image-caption pair information. Our method compares favorably to existing open-vocabulary detection approaches while being data-efficient. Source code is available at https://github.com/lmb-freiburg/locov .

Results

TaskDatasetMetricValueModel
Object DetectionMSCOCOAP 0.528.6LocOv (RN50-C4)
Object DetectionOVAD benchmarkmean average precision14.9LocOv (ResNet50)
3DMSCOCOAP 0.528.6LocOv (RN50-C4)
3DOVAD benchmarkmean average precision14.9LocOv (ResNet50)
2D ClassificationMSCOCOAP 0.528.6LocOv (RN50-C4)
2D ClassificationOVAD benchmarkmean average precision14.9LocOv (ResNet50)
2D Object DetectionMSCOCOAP 0.528.6LocOv (RN50-C4)
2D Object DetectionOVAD benchmarkmean average precision14.9LocOv (ResNet50)
Open Vocabulary Object DetectionMSCOCOAP 0.528.6LocOv (RN50-C4)
Open Vocabulary Object DetectionOVAD benchmarkmean average precision14.9LocOv (ResNet50)
16kMSCOCOAP 0.528.6LocOv (RN50-C4)
16kOVAD benchmarkmean average precision14.9LocOv (ResNet50)

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21Decoupled PROB: Decoupled Query Initialization Tasks and Objectness-Class Learning for Open World Object Detection2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17RS-TinyNet: Stage-wise Feature Fusion Network for Detecting Tiny Objects in Remote Sensing Images2025-07-17Dual LiDAR-Based Traffic Movement Count Estimation at a Signalized Intersection: Deployment, Data Collection, and Preliminary Analysis2025-07-17Making Language Model a Hierarchical Classifier and Generator2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17