Localized Vision-Language Matching for Open-vocabulary Object Detection

Maria A. Bravo, Sudhanshu Mittal, Thomas Brox

2022-05-12Open World Object Detection Open Vocabulary Attribute Detection Open Vocabulary Object Detection object-detection Object Detection Language Modelling

Paper PDF Code(official)

Abstract

In this work, we propose an open-vocabulary object detection method that, based on image-caption pairs, learns to detect novel object classes along with a given set of known classes. It is a two-stage training approach that first uses a location-guided image-caption matching technique to learn class labels for both novel and known classes in a weakly-supervised manner and second specializes the model for the object detection task using known class annotations. We show that a simple language model fits better than a large contextualized language model for detecting novel objects. Moreover, we introduce a consistency-regularization technique to better exploit image-caption pair information. Our method compares favorably to existing open-vocabulary detection approaches while being data-efficient. Source code is available at https://github.com/lmb-freiburg/locov .

Results

Task	Dataset	Metric	Value	Model
Object Detection	MSCOCO	AP 0.5	28.6	LocOv (RN50-C4)
Object Detection	OVAD benchmark	mean average precision	14.9	LocOv (ResNet50)
3D	MSCOCO	AP 0.5	28.6	LocOv (RN50-C4)
3D	OVAD benchmark	mean average precision	14.9	LocOv (ResNet50)
2D Classification	MSCOCO	AP 0.5	28.6	LocOv (RN50-C4)
2D Classification	OVAD benchmark	mean average precision	14.9	LocOv (ResNet50)
2D Object Detection	MSCOCO	AP 0.5	28.6	LocOv (RN50-C4)
2D Object Detection	OVAD benchmark	mean average precision	14.9	LocOv (ResNet50)
Open Vocabulary Object Detection	MSCOCO	AP 0.5	28.6	LocOv (RN50-C4)
Open Vocabulary Object Detection	OVAD benchmark	mean average precision	14.9	LocOv (ResNet50)
16k	MSCOCO	AP 0.5	28.6	LocOv (RN50-C4)
16k	OVAD benchmark	mean average precision	14.9	LocOv (ResNet50)

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21 Decoupled PROB: Decoupled Query Initialization Tasks and Objectness-Class Learning for Open World Object Detection2025-07-17 A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17 RS-TinyNet: Stage-wise Feature Fusion Network for Detecting Tiny Objects in Remote Sensing Images2025-07-17 Dual LiDAR-Based Traffic Movement Count Estimation at a Signalized Intersection: Deployment, Data Collection, and Preliminary Analysis2025-07-17 Making Language Model a Hierarchical Classifier and Generator2025-07-17 VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17 The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17