VirTex: Learning Visual Representations from Textual Annotations

Karan Desai, Justin Johnson

2020-06-11CVPR 2021 1Image Classification Semantic Segmentation Image Captioning Instance Segmentation General Classification object-detection Object Detection

Paper PDF Code(official)Code Code

Abstract

The de-facto approach to many vision tasks is to start from pretrained visual representations, typically learned via supervised training on ImageNet. Recent methods have explored unsupervised pretraining to scale to vast quantities of unlabeled images. In contrast, we aim to learn high-quality visual representations from fewer images. To this end, we revisit supervised pretraining, and seek data-efficient alternatives to classification-based pretraining. We propose VirTex -- a pretraining approach using semantically dense captions to learn visual representations. We train convolutional networks from scratch on COCO Captions, and transfer them to downstream recognition tasks including image classification, object detection, and instance segmentation. On all tasks, VirTex yields features that match or exceed those learned on ImageNet -- supervised or unsupervised -- despite using up to ten times fewer images.

Results

Task	Dataset	Metric	Value	Model
Image Captioning	COCO Captions	CIDER	94	Virtex (ResNet-101)
Image Captioning	COCO Captions	SPICE	18.5	Virtex (ResNet-101)
Object Detection	COCO minival	box AP	40.9	VirTex Mask R-CNN (ResNet-50-FPN)
3D	COCO minival	box AP	40.9	VirTex Mask R-CNN (ResNet-50-FPN)
Instance Segmentation	COCO test-dev	AP50	58.4	VirTex Mask R-CNN (ResNet-50-FPN)
Instance Segmentation	COCO test-dev	AP75	39.7	VirTex Mask R-CNN (ResNet-50-FPN)
Instance Segmentation	COCO test-dev	mask AP	36.9	VirTex Mask R-CNN (ResNet-50-FPN)
2D Classification	COCO minival	box AP	40.9	VirTex Mask R-CNN (ResNet-50-FPN)
2D Object Detection	COCO minival	box AP	40.9	VirTex Mask R-CNN (ResNet-50-FPN)
16k	COCO minival	box AP	40.9	VirTex Mask R-CNN (ResNet-50-FPN)

VirTex: Learning Visual Representations from Textual Annotations

Abstract

Results

Related Papers

VirTex: Learning Visual Representations from Textual Annotations

Abstract

Results

Related Papers