CoCa: Contrastive Captioners are Image-Text Foundation Models

Jiahui Yu, ZiRui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, Yonghui Wu

2022-05-04Zero-Shot Cross-Modal Retrieval Video Retrieval Image Classification Action Classification Representation Learning Visual Entailment Image Captioning Visual Reasoning Zero-Shot Transfer Image Classification Retrieval Visual Question Answering (VQA)Visual Question Answering

Paper PDF Code Code Code Code Code Code

Abstract

Exploring large-scale pretrained foundation models is of significant interest in computer vision because these models can be quickly transferred to many downstream tasks. This paper presents Contrastive Captioner (CoCa), a minimalist design to pretrain an image-text encoder-decoder foundation model jointly with contrastive loss and captioning loss, thereby subsuming model capabilities from contrastive approaches like CLIP and generative methods like SimVLM. In contrast to standard encoder-decoder transformers where all decoder layers attend to encoder outputs, CoCa omits cross-attention in the first half of decoder layers to encode unimodal text representations, and cascades the remaining decoder layers which cross-attend to the image encoder for multimodal image-text representations. We apply a contrastive loss between unimodal image and text embeddings, in addition to a captioning loss on the multimodal decoder outputs which predicts text tokens autoregressively. By sharing the same computational graph, the two training objectives are computed efficiently with minimal overhead. CoCa is pretrained end-to-end and from scratch on both web-scale alt-text data and annotated images by treating all labels simply as text, seamlessly unifying natural language supervision for representation learning. Empirically, CoCa achieves state-of-the-art performance with zero-shot transfer or minimal task-specific adaptation on a broad range of downstream tasks, spanning visual recognition (ImageNet, Kinetics-400/600/700, Moments-in-Time), crossmodal retrieval (MSCOCO, Flickr30K, MSR-VTT), multimodal understanding (VQA, SNLI-VE, NLVR2), and image captioning (MSCOCO, NoCaps). Notably on ImageNet classification, CoCa obtains 86.3% zero-shot top-1 accuracy, 90.6% with a frozen encoder and learned classification head, and new state-of-the-art 91.0% top-1 accuracy on ImageNet with a finetuned encoder.

Results

Task	Dataset	Metric	Value	Model
Video	MSR-VTT	text-to-video R@1	30	CoCa (zero-shot)
Video	MSR-VTT	text-to-video R@10	61.6	CoCa (zero-shot)
Video	MSR-VTT	text-to-video R@5	52.4	CoCa (zero-shot)
Video	MSR-VTT	video-to-text R@1	49.9	CoCa (zero-shot)
Video	MSR-VTT	video-to-text R@10	81.4	CoCa (zero-shot)
Video	MSR-VTT	video-to-text R@5	73.4	CoCa (zero-shot)
Video	Kinetics-700	Top-1 Accuracy	82.7	CoCa (finetuned)
Video	Kinetics-700	Top-1 Accuracy	81.1	CoCa (frozen)
Video	Moments in Time	Top 1 Accuracy	49	CoCa (finetuned)
Video	Moments in Time	Top 1 Accuracy	47.4	CoCa (frozen)
Video	Kinetics-400	Acc@1	88.9	CoCa (finetuned)
Video	Kinetics-400	Acc@1	88	CoCa (frozen)
Video	Kinetics-600	Top-1 Accuracy	89.4	CoCa (finetuned)
Video	Kinetics-600	Top-1 Accuracy	88.5	CoCa (frozen)
Visual Question Answering (VQA)	VQA v2 test-dev	Accuracy	82.3	CoCa
Visual Reasoning	NLVR2 Dev	Accuracy	86.1	CoCa
Visual Reasoning	NLVR2 Test	Accuracy	87	CoCa
Natural Language Inference	SNLI-VE val	Accuracy	87	CoCa
Natural Language Inference	SNLI-VE test	Accuracy	87.1	CoCa
Image Captioning	COCO Captions	BLEU-4	40.9	CoCa
Image Captioning	COCO Captions	CIDER	143.6	CoCa
Image Captioning	COCO Captions	METEOR	33.9	CoCa
Image Captioning	COCO Captions	SPICE	24.7	CoCa
Image Retrieval with Multi-Modal Query	Flickr30k	Image-to-text R@1	92.5	CoCa
Image Retrieval with Multi-Modal Query	Flickr30k	Image-to-text R@10	99.9	CoCa
Image Retrieval with Multi-Modal Query	Flickr30k	Image-to-text R@5	99.5	CoCa
Image Retrieval with Multi-Modal Query	Flickr30k	Text-to-image R@1	80.4	CoCa
Image Retrieval with Multi-Modal Query	Flickr30k	Text-to-image R@10	97.7	CoCa
Image Retrieval with Multi-Modal Query	Flickr30k	Text-to-image R@5	95.7	CoCa
Image Retrieval with Multi-Modal Query	COCO 2014	Image-to-text R@1	66.3	CoCa
Image Retrieval with Multi-Modal Query	COCO 2014	Image-to-text R@10	91.8	CoCa
Image Retrieval with Multi-Modal Query	COCO 2014	Image-to-text R@5	86.2	CoCa
Image Retrieval with Multi-Modal Query	COCO 2014	Text-to-image R@1	51.2	CoCa
Image Retrieval with Multi-Modal Query	COCO 2014	Text-to-image R@10	82	CoCa
Image Retrieval with Multi-Modal Query	COCO 2014	Text-to-image R@5	74.2	CoCa
Image Classification	ObjectNet	Top-1 Accuracy	82.7	CoCa
Video Retrieval	MSR-VTT	text-to-video R@1	30	CoCa (zero-shot)
Video Retrieval	MSR-VTT	text-to-video R@10	61.6	CoCa (zero-shot)
Video Retrieval	MSR-VTT	text-to-video R@5	52.4	CoCa (zero-shot)
Video Retrieval	MSR-VTT	video-to-text R@1	49.9	CoCa (zero-shot)
Video Retrieval	MSR-VTT	video-to-text R@10	81.4	CoCa (zero-shot)
Video Retrieval	MSR-VTT	video-to-text R@5	73.4	CoCa (zero-shot)
Zero-Shot Transfer Image Classification	ImageNet V2	Accuracy (Private)	80.7	CoCa
Zero-Shot Transfer Image Classification	ImageNet-A	Accuracy (Private)	90.2	CoCa
Zero-Shot Transfer Image Classification	ImageNet	Accuracy (Private)	86.3	CoCa
Zero-Shot Transfer Image Classification	ImageNet-R	Accuracy	96.5	CoCa
Zero-Shot Transfer Image Classification	ObjectNet	Accuracy (Private)	82.7	CoCa
Zero-Shot Transfer Image Classification	ImageNet-Sketch	Accuracy (Private)	77.6	CoCa
Visual Question Answering	VQA v2 test-dev	Accuracy	82.3	CoCa

Abstract

Results

Task	Dataset	Metric	Value	Model
Video	MSR-VTT	text-to-video R@1	30	CoCa (zero-shot)
Video	MSR-VTT	text-to-video R@10	61.6	CoCa (zero-shot)
Video	MSR-VTT	text-to-video R@5	52.4	CoCa (zero-shot)
Video	MSR-VTT	video-to-text R@1	49.9	CoCa (zero-shot)
Video	MSR-VTT	video-to-text R@10	81.4	CoCa (zero-shot)
Video	MSR-VTT	video-to-text R@5	73.4	CoCa (zero-shot)
Video	Kinetics-700	Top-1 Accuracy	82.7	CoCa (finetuned)
Video	Kinetics-700	Top-1 Accuracy	81.1	CoCa (frozen)
Video	Moments in Time	Top 1 Accuracy	49	CoCa (finetuned)
Video	Moments in Time	Top 1 Accuracy	47.4	CoCa (frozen)
Video	Kinetics-400	Acc@1	88.9	CoCa (finetuned)
Video	Kinetics-400	Acc@1	88	CoCa (frozen)
Video	Kinetics-600	Top-1 Accuracy	89.4	CoCa (finetuned)
Video	Kinetics-600	Top-1 Accuracy	88.5	CoCa (frozen)
Visual Question Answering (VQA)	VQA v2 test-dev	Accuracy	82.3	CoCa
Visual Reasoning	NLVR2 Dev	Accuracy	86.1	CoCa
Visual Reasoning	NLVR2 Test	Accuracy	87	CoCa
Natural Language Inference	SNLI-VE val	Accuracy	87	CoCa
Natural Language Inference	SNLI-VE test	Accuracy	87.1	CoCa
Image Captioning	COCO Captions	BLEU-4	40.9	CoCa
Image Captioning	COCO Captions	CIDER	143.6	CoCa
Image Captioning	COCO Captions	METEOR	33.9	CoCa
Image Captioning	COCO Captions	SPICE	24.7	CoCa
Image Retrieval with Multi-Modal Query	Flickr30k	Image-to-text R@1	92.5	CoCa
Image Retrieval with Multi-Modal Query	Flickr30k	Image-to-text R@10	99.9	CoCa
Image Retrieval with Multi-Modal Query	Flickr30k	Image-to-text R@5	99.5	CoCa
Image Retrieval with Multi-Modal Query	Flickr30k	Text-to-image R@1	80.4	CoCa
Image Retrieval with Multi-Modal Query	Flickr30k	Text-to-image R@10	97.7	CoCa
Image Retrieval with Multi-Modal Query	Flickr30k	Text-to-image R@5	95.7	CoCa
Image Retrieval with Multi-Modal Query	COCO 2014	Image-to-text R@1	66.3	CoCa
Image Retrieval with Multi-Modal Query	COCO 2014	Image-to-text R@10	91.8	CoCa
Image Retrieval with Multi-Modal Query	COCO 2014	Image-to-text R@5	86.2	CoCa
Image Retrieval with Multi-Modal Query	COCO 2014	Text-to-image R@1	51.2	CoCa
Image Retrieval with Multi-Modal Query	COCO 2014	Text-to-image R@10	82	CoCa
Image Retrieval with Multi-Modal Query	COCO 2014	Text-to-image R@5	74.2	CoCa
Image Classification	ObjectNet	Top-1 Accuracy	82.7	CoCa
Video Retrieval	MSR-VTT	text-to-video R@1	30	CoCa (zero-shot)
Video Retrieval	MSR-VTT	text-to-video R@10	61.6	CoCa (zero-shot)
Video Retrieval	MSR-VTT	text-to-video R@5	52.4	CoCa (zero-shot)
Video Retrieval	MSR-VTT	video-to-text R@1	49.9	CoCa (zero-shot)
Video Retrieval	MSR-VTT	video-to-text R@10	81.4	CoCa (zero-shot)
Video Retrieval	MSR-VTT	video-to-text R@5	73.4	CoCa (zero-shot)
Zero-Shot Transfer Image Classification	ImageNet V2	Accuracy (Private)	80.7	CoCa
Zero-Shot Transfer Image Classification	ImageNet-A	Accuracy (Private)	90.2	CoCa
Zero-Shot Transfer Image Classification	ImageNet	Accuracy (Private)	86.3	CoCa
Zero-Shot Transfer Image Classification	ImageNet-R	Accuracy	96.5	CoCa
Zero-Shot Transfer Image Classification	ObjectNet	Accuracy (Private)	82.7	CoCa
Zero-Shot Transfer Image Classification	ImageNet-Sketch	Accuracy (Private)	77.6	CoCa
Visual Question Answering	VQA v2 test-dev	Accuracy	82.3	CoCa

CoCa: Contrastive Captioners are Image-Text Foundation Models

Abstract

Results

Related Papers

CoCa: Contrastive Captioners are Image-Text Foundation Models

Abstract

Results

Related Papers