TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Retrieval-Augmented Open-Vocabulary Object Detection

Retrieval-Augmented Open-Vocabulary Object Detection

Jooyeon Kim, Eulrang Cho, Sehyung Kim, Hyunwoo J. Kim

2024-04-08CVPR 2024 1Semantic SimilaritySemantic Textual SimilarityLarge Language ModelOpen Vocabulary Object DetectionRetrievalobject-detectionObject DetectionLanguage Modelling
PaperPDFCode(official)

Abstract

Open-vocabulary object detection (OVD) has been studied with Vision-Language Models (VLMs) to detect novel objects beyond the pre-trained categories. Previous approaches improve the generalization ability to expand the knowledge of the detector, using 'positive' pseudo-labels with additional 'class' names, e.g., sock, iPod, and alligator. To extend the previous methods in two aspects, we propose Retrieval-Augmented Losses and visual Features (RALF). Our method retrieves related 'negative' classes and augments loss functions. Also, visual features are augmented with 'verbalized concepts' of classes, e.g., worn on the feet, handheld music player, and sharp teeth. Specifically, RALF consists of two modules: Retrieval Augmented Losses (RAL) and Retrieval-Augmented visual Features (RAF). RAL constitutes two losses reflecting the semantic similarity with negative vocabularies. In addition, RAF augments visual features with the verbalized concepts from a large language model (LLM). Our experiments demonstrate the effectiveness of RALF on COCO and LVIS benchmark datasets. We achieve improvement up to 3.4 box AP$_{50}^{\text{N}}$ on novel categories of the COCO dataset and 3.6 mask AP$_{\text{r}}$ gains on the LVIS dataset. Code is available at https://github.com/mlvlab/RALF .

Results

TaskDatasetMetricValueModel
Object DetectionLVIS v1.0AP novel-LVIS base training21.9RALF
Object DetectionMSCOCOAP 0.541.3RALF
3DLVIS v1.0AP novel-LVIS base training21.9RALF
3DMSCOCOAP 0.541.3RALF
2D ClassificationLVIS v1.0AP novel-LVIS base training21.9RALF
2D ClassificationMSCOCOAP 0.541.3RALF
2D Object DetectionLVIS v1.0AP novel-LVIS base training21.9RALF
2D Object DetectionMSCOCOAP 0.541.3RALF
Open Vocabulary Object DetectionLVIS v1.0AP novel-LVIS base training21.9RALF
Open Vocabulary Object DetectionMSCOCOAP 0.541.3RALF
16kLVIS v1.0AP novel-LVIS base training21.9RALF
16kMSCOCOAP 0.541.3RALF

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21DENSE: Longitudinal Progress Note Generation with Temporal Modeling of Heterogeneous Clinical Notes Across Hospital Visits2025-07-18SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts2025-07-17GeoReg: Weight-Constrained Few-Shot Regression for Socio-Economic Estimation using LLM2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities2025-07-17Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities2025-07-17From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17