TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/CoDet: Co-Occurrence Guided Region-Word Alignment for Open...

CoDet: Co-Occurrence Guided Region-Word Alignment for Open-Vocabulary Object Detection

Chuofan Ma, Yi Jiang, Xin Wen, Zehuan Yuan, Xiaojuan Qi

2023-10-25NeurIPS 2023 11Object DiscoveryOpen Vocabulary Object DetectionWord Alignmentobject-detectionObject Detection
PaperPDFCode(official)

Abstract

Deriving reliable region-word alignment from image-text pairs is critical to learn object-level vision-language representations for open-vocabulary object detection. Existing methods typically rely on pre-trained or self-trained vision-language models for alignment, which are prone to limitations in localization accuracy or generalization capabilities. In this paper, we propose CoDet, a novel approach that overcomes the reliance on pre-aligned vision-language space by reformulating region-word alignment as a co-occurring object discovery problem. Intuitively, by grouping images that mention a shared concept in their captions, objects corresponding to the shared concept shall exhibit high co-occurrence among the group. CoDet then leverages visual similarities to discover the co-occurring objects and align them with the shared concept. Extensive experiments demonstrate that CoDet has superior performances and compelling scalability in open-vocabulary detection, e.g., by scaling up the visual backbone, CoDet achieves 37.0 $\text{AP}^m_{novel}$ and 44.7 $\text{AP}^m_{all}$ on OV-LVIS, surpassing the previous SoTA by 4.2 $\text{AP}^m_{novel}$ and 9.8 $\text{AP}^m_{all}$. Code is available at https://github.com/CVMI-Lab/CoDet.

Results

TaskDatasetMetricValueModel
Object DetectionLVIS v1.0AP novel-LVIS base training37CoDet (EVA02-L)
3DLVIS v1.0AP novel-LVIS base training37CoDet (EVA02-L)
2D ClassificationLVIS v1.0AP novel-LVIS base training37CoDet (EVA02-L)
2D Object DetectionLVIS v1.0AP novel-LVIS base training37CoDet (EVA02-L)
Open Vocabulary Object DetectionLVIS v1.0AP novel-LVIS base training37CoDet (EVA02-L)
16kLVIS v1.0AP novel-LVIS base training37CoDet (EVA02-L)

Related Papers

A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17RS-TinyNet: Stage-wise Feature Fusion Network for Detecting Tiny Objects in Remote Sensing Images2025-07-17Decoupled PROB: Decoupled Query Initialization Tasks and Objectness-Class Learning for Open World Object Detection2025-07-17Dual LiDAR-Based Traffic Movement Count Estimation at a Signalized Intersection: Deployment, Data Collection, and Preliminary Analysis2025-07-17Vision-based Perception for Autonomous Vehicles in Obstacle Avoidance Scenarios2025-07-16Tomato Multi-Angle Multi-Pose Dataset for Fine-Grained Phenotyping2025-07-15ECORE: Energy-Conscious Optimized Routing for Deep Learning Models at the Edge2025-07-08Beyond One Shot, Beyond One Perspective: Cross-View and Long-Horizon Distillation for Better LiDAR Representations2025-07-07