TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Region-Aware Pretraining for Open-Vocabulary Object Detect...

Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers

Dahun Kim, Anelia Angelova, Weicheng Kuo

2023-05-11CVPR 2023 1Zero-Shot Cross-Modal RetrievalImage-text RetrievalText RetrievalContrastive LearningOpen Vocabulary Object DetectionRetrievalobject-detectionObject Detection
PaperPDFCode(official)Code(official)

Abstract

We present Region-aware Open-vocabulary Vision Transformers (RO-ViT) - a contrastive image-text pretraining recipe to bridge the gap between image-level pretraining and open-vocabulary object detection. At the pretraining phase, we propose to randomly crop and resize regions of positional embeddings instead of using the whole image positional embeddings. This better matches the use of positional embeddings at region-level in the detection finetuning phase. In addition, we replace the common softmax cross entropy loss in contrastive learning with focal loss to better learn the informative yet difficult examples. Finally, we leverage recent advances in novel object proposals to improve open-vocabulary detection finetuning. We evaluate our full model on the LVIS and COCO open-vocabulary detection benchmarks and zero-shot transfer. RO-ViT achieves a state-of-the-art 34.1 $AP_r$ on LVIS, surpassing the best existing approach by +7.8 points in addition to competitive zero-shot transfer detection. Surprisingly, RO-ViT improves the image-level representation as well and achieves the state of the art on 9 out of 12 metrics on COCO and Flickr image-text retrieval benchmarks, outperforming competitive approaches with larger models.

Results

TaskDatasetMetricValueModel
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@192.1RO-ViT
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@1099.7RO-ViT
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@599.4RO-ViT
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@180.7RO-ViT
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@1097.7RO-ViT
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@596.1RO-ViT
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@168.9RO-ViT
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@1092.2RO-ViT
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@587.8RO-ViT
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@151.8RO-ViT
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@1083RO-ViT
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@575RO-ViT
Object DetectionLVIS v1.0AP novel-LVIS base training32.1RO-ViT
3DLVIS v1.0AP novel-LVIS base training32.1RO-ViT
2D ClassificationLVIS v1.0AP novel-LVIS base training32.1RO-ViT
2D Object DetectionLVIS v1.0AP novel-LVIS base training32.1RO-ViT
Open Vocabulary Object DetectionLVIS v1.0AP novel-LVIS base training32.1RO-ViT
16kLVIS v1.0AP novel-LVIS base training32.1RO-ViT

Related Papers

SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management2025-07-17SGCL: Unifying Self-Supervised and Supervised Learning for Graph Recommendation2025-07-17From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17