Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers

Dahun Kim, Anelia Angelova, Weicheng Kuo

2023-05-11CVPR 2023 1Zero-Shot Cross-Modal Retrieval Image-text Retrieval Text Retrieval Contrastive Learning Open Vocabulary Object Detection Retrieval object-detection Object Detection

Paper PDF Code(official)Code(official)

Abstract

We present Region-aware Open-vocabulary Vision Transformers (RO-ViT) - a contrastive image-text pretraining recipe to bridge the gap between image-level pretraining and open-vocabulary object detection. At the pretraining phase, we propose to randomly crop and resize regions of positional embeddings instead of using the whole image positional embeddings. This better matches the use of positional embeddings at region-level in the detection finetuning phase. In addition, we replace the common softmax cross entropy loss in contrastive learning with focal loss to better learn the informative yet difficult examples. Finally, we leverage recent advances in novel object proposals to improve open-vocabulary detection finetuning. We evaluate our full model on the LVIS and COCO open-vocabulary detection benchmarks and zero-shot transfer. RO-ViT achieves a state-of-the-art 34.1 $AP_r$ on LVIS, surpassing the best existing approach by +7.8 points in addition to competitive zero-shot transfer detection. Surprisingly, RO-ViT improves the image-level representation as well and achieves the state of the art on 9 out of 12 metrics on COCO and Flickr image-text retrieval benchmarks, outperforming competitive approaches with larger models.

Results

Task	Dataset	Metric	Value	Model
Image Retrieval with Multi-Modal Query	Flickr30k	Image-to-text R@1	92.1	RO-ViT
Image Retrieval with Multi-Modal Query	Flickr30k	Image-to-text R@10	99.7	RO-ViT
Image Retrieval with Multi-Modal Query	Flickr30k	Image-to-text R@5	99.4	RO-ViT
Image Retrieval with Multi-Modal Query	Flickr30k	Text-to-image R@1	80.7	RO-ViT
Image Retrieval with Multi-Modal Query	Flickr30k	Text-to-image R@10	97.7	RO-ViT
Image Retrieval with Multi-Modal Query	Flickr30k	Text-to-image R@5	96.1	RO-ViT
Image Retrieval with Multi-Modal Query	COCO 2014	Image-to-text R@1	68.9	RO-ViT
Image Retrieval with Multi-Modal Query	COCO 2014	Image-to-text R@10	92.2	RO-ViT
Image Retrieval with Multi-Modal Query	COCO 2014	Image-to-text R@5	87.8	RO-ViT
Image Retrieval with Multi-Modal Query	COCO 2014	Text-to-image R@1	51.8	RO-ViT
Image Retrieval with Multi-Modal Query	COCO 2014	Text-to-image R@10	83	RO-ViT
Image Retrieval with Multi-Modal Query	COCO 2014	Text-to-image R@5	75	RO-ViT
Object Detection	LVIS v1.0	AP novel-LVIS base training	32.1	RO-ViT
3D	LVIS v1.0	AP novel-LVIS base training	32.1	RO-ViT
2D Classification	LVIS v1.0	AP novel-LVIS base training	32.1	RO-ViT
2D Object Detection	LVIS v1.0	AP novel-LVIS base training	32.1	RO-ViT
Open Vocabulary Object Detection	LVIS v1.0	AP novel-LVIS base training	32.1	RO-ViT
16k	LVIS v1.0	AP novel-LVIS base training	32.1	RO-ViT

Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers

Abstract

Results

Related Papers

Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers

Abstract

Results

Related Papers