TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Region-centric Image-Language Pretraining for Open-Vocabul...

Region-centric Image-Language Pretraining for Open-Vocabulary Detection

Dahun Kim, Anelia Angelova, Weicheng Kuo

2023-09-29Contrastive LearningOpen Vocabulary Object Detectionobject-detectionObject Detection
PaperPDFCode(official)Code(official)

Abstract

We present a new open-vocabulary detection approach based on region-centric image-language pretraining to bridge the gap between image-level pretraining and open-vocabulary object detection. At the pretraining phase, we incorporate the detector architecture on top of the classification backbone, which better serves the region-level recognition needs of detection by enabling the detector heads to learn from large-scale image-text pairs. Using only standard contrastive loss and no pseudo-labeling, our approach is a simple yet effective extension of the contrastive learning method to learn emergent object-semantic cues. In addition, we propose a shifted-window learning approach upon window attention to make the backbone representation more robust, translation-invariant, and less biased by the window pattern. On the popular LVIS open-vocabulary detection benchmark, our approach sets a new state of the art of 37.6 mask APr using the common ViT-L backbone and public LAION dataset, and 40.5 mask APr using the DataComp-1B dataset, significantly outperforming the best existing approach by +3.7 mask APr at system level. On the COCO benchmark, we achieve very competitive 39.6 novel AP without pseudo labeling or weak supervision. In addition, we evaluate our approach on the transfer detection setup, where it demonstrates notable improvement over the baseline. Visualization reveals emerging object locality from the pretraining recipes compared to the baseline.

Results

TaskDatasetMetricValueModel
Object DetectionLVIS v1.0AP novel-LVIS base training40.4DITO
Object DetectionLVIS v1.0AP novel-Unrestricted open-vocabulary training45.8DITO
Object DetectionMSCOCOAP 0.546.1DITO
3DLVIS v1.0AP novel-LVIS base training40.4DITO
3DLVIS v1.0AP novel-Unrestricted open-vocabulary training45.8DITO
3DMSCOCOAP 0.546.1DITO
2D ClassificationLVIS v1.0AP novel-LVIS base training40.4DITO
2D ClassificationLVIS v1.0AP novel-Unrestricted open-vocabulary training45.8DITO
2D ClassificationMSCOCOAP 0.546.1DITO
2D Object DetectionLVIS v1.0AP novel-LVIS base training40.4DITO
2D Object DetectionLVIS v1.0AP novel-Unrestricted open-vocabulary training45.8DITO
2D Object DetectionMSCOCOAP 0.546.1DITO
Open Vocabulary Object DetectionLVIS v1.0AP novel-LVIS base training40.4DITO
Open Vocabulary Object DetectionLVIS v1.0AP novel-Unrestricted open-vocabulary training45.8DITO
Open Vocabulary Object DetectionMSCOCOAP 0.546.1DITO
16kLVIS v1.0AP novel-LVIS base training40.4DITO
16kLVIS v1.0AP novel-Unrestricted open-vocabulary training45.8DITO
16kMSCOCOAP 0.546.1DITO

Related Papers

SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management2025-07-17SGCL: Unifying Self-Supervised and Supervised Learning for Graph Recommendation2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17RS-TinyNet: Stage-wise Feature Fusion Network for Detecting Tiny Objects in Remote Sensing Images2025-07-17Decoupled PROB: Decoupled Query Initialization Tasks and Objectness-Class Learning for Open World Object Detection2025-07-17Dual LiDAR-Based Traffic Movement Count Estimation at a Signalized Intersection: Deployment, Data Collection, and Preliminary Analysis2025-07-17