Exploiting Unlabeled Data with Vision and Language Models for Object Detection

Shiyu Zhao, Zhixing Zhang, Samuel Schulter, Long Zhao, Vijay Kumar B. G, Anastasis Stathopoulos, Manmohan Chandraker, Dimitris Metaxas

2022-07-18Region Proposal Open Vocabulary Object Detection object-detection Object Detection Semi-Supervised Object Detection

Paper PDF Code(official)

Abstract

Building robust and generic object detection frameworks requires scaling to larger label spaces and bigger training datasets. However, it is prohibitively costly to acquire annotations for thousands of categories at a large scale. We propose a novel method that leverages the rich semantics available in recent vision and language models to localize and classify objects in unlabeled images, effectively generating pseudo labels for object detection. Starting with a generic and class-agnostic region proposal mechanism, we use vision and language models to categorize each region of an image into any object category that is required for downstream tasks. We demonstrate the value of the generated pseudo labels in two specific tasks, open-vocabulary detection, where a model needs to generalize to unseen object categories, and semi-supervised object detection, where additional unlabeled images can be used to improve the model. Our empirical evaluation shows the effectiveness of the pseudo labels in both tasks, where we outperform competitive baselines and achieve a novel state-of-the-art for open-vocabulary object detection. Our code is available at https://github.com/xiaofeng94/VL-PLM.

Results

Task	Dataset	Metric	Value	Model
Object Detection	MSCOCO	AP 0.5	34.4	VL-PLM (RN50)
3D	MSCOCO	AP 0.5	34.4	VL-PLM (RN50)
2D Classification	MSCOCO	AP 0.5	34.4	VL-PLM (RN50)
2D Object Detection	MSCOCO	AP 0.5	34.4	VL-PLM (RN50)
Open Vocabulary Object Detection	MSCOCO	AP 0.5	34.4	VL-PLM (RN50)
16k	MSCOCO	AP 0.5	34.4	VL-PLM (RN50)

Related Papers

A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17 RS-TinyNet: Stage-wise Feature Fusion Network for Detecting Tiny Objects in Remote Sensing Images2025-07-17 Decoupled PROB: Decoupled Query Initialization Tasks and Objectness-Class Learning for Open World Object Detection2025-07-17 Dual LiDAR-Based Traffic Movement Count Estimation at a Signalized Intersection: Deployment, Data Collection, and Preliminary Analysis2025-07-17 Vision-based Perception for Autonomous Vehicles in Obstacle Avoidance Scenarios2025-07-16 Tomato Multi-Angle Multi-Pose Dataset for Fine-Grained Phenotyping2025-07-15 ECORE: Energy-Conscious Optimized Routing for Deep Learning Models at the Edge2025-07-08 Beyond One Shot, Beyond One Perspective: Cross-View and Long-Horizon Distillation for Better LiDAR Representations2025-07-07