Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever

Abstract

State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at https://github.com/OpenAI/CLIP.

Results

Task	Dataset	Metric	Value	Model
Zero-Shot Learning	VOC-MLT	Average mAP	84.3	CLIP(ResNet-50)
Zero-Shot Learning	VOC-MLT	Average mAP	85.77	CLIP(ViT-B/16)
Zero-Shot Learning	COCO-MLT	Average mAP	56.19	ResNet-50
Zero-Shot Learning	COCO-MLT	Average mAP	60.17	ViT-B/16
Activity Recognition	RareAct	mWAP	40.7	CLIP
Image Retrieval with Multi-Modal Query	Flickr30k	Image-to-text R@1	88	CLIP
Image Retrieval with Multi-Modal Query	Flickr30k	Image-to-text R@10	99.4	CLIP
Image Retrieval with Multi-Modal Query	Flickr30k	Image-to-text R@5	98.7	CLIP
Image Retrieval with Multi-Modal Query	Flickr30k	Text-to-image R@1	68.7	CLIP
Image Retrieval with Multi-Modal Query	Flickr30k	Text-to-image R@10	95.2	CLIP
Image Retrieval with Multi-Modal Query	Flickr30k	Text-to-image R@5	90.6	CLIP
Image Retrieval with Multi-Modal Query	COCO 2014	Image-to-text R@1	58.4	CLIP
Image Retrieval with Multi-Modal Query	COCO 2014	Image-to-text R@10	88.1	CLIP
Image Retrieval with Multi-Modal Query	COCO 2014	Image-to-text R@5	81.5	CLIP
Image Retrieval with Multi-Modal Query	COCO 2014	Text-to-image R@1	37.8	CLIP
Image Retrieval with Multi-Modal Query	COCO 2014	Text-to-image R@10	72.2	CLIP
Image Retrieval with Multi-Modal Query	COCO 2014	Text-to-image R@5	62.4	CLIP
Object Detection	OVAD-Box benchmark	mean average precision	16.6	CLIP VIT-B16
Image Classification	OmniBenchmark	Average Top-1 Accuracy	42.1	CLIP-RN50
Image Classification	ObjectNet	Top-1 Accuracy	72.3	CLIP
Image Classification	COCO-MLT	Average mAP	60.17	CLIP(ViT-B/16)
Image Classification	COCO-MLT	Average mAP	56.19	CLIP(ResNet-50)
Image Classification	VOC-MLT	Average mAP	85.77	CLIP(ViT-B/16)
Image Classification	VOC-MLT	Average mAP	84.3	CLIP(ResNet-50)
3D	OVAD-Box benchmark	mean average precision	16.6	CLIP VIT-B16
Action Recognition	RareAct	mWAP	40.7	CLIP
Object Recognition	shape bias	shape bias	79.9	CLIP (ViT-B)
Few-Shot Image Classification	COCO-MLT	Average mAP	60.17	CLIP(ViT-B/16)
Few-Shot Image Classification	COCO-MLT	Average mAP	56.19	CLIP(ResNet-50)
Few-Shot Image Classification	VOC-MLT	Average mAP	85.77	CLIP(ViT-B/16)
Few-Shot Image Classification	VOC-MLT	Average mAP	84.3	CLIP(ResNet-50)
Meme Classification	Hateful Memes	ROC-AUC	0.661	CLIP (zero-shot)
Meme Classification	MultiOFF	Accuracy	62.4	CLIP
Meme Classification	MultiOFF	F1	48.1	CLIP
Meme Classification	Harm-P	Accuracy	80.6	CLIP
Meme Classification	Harm-P	F1	80.3	CLIP
Meme Classification	PrideMM	Accuracy	72.4	CLIP (fine-tuned)
Meme Classification	PrideMM	F1	72.3	CLIP (fine-tuned)
Generalized Few-Shot Classification	COCO-MLT	Average mAP	60.17	CLIP(ViT-B/16)
Generalized Few-Shot Classification	COCO-MLT	Average mAP	56.19	CLIP(ResNet-50)
Generalized Few-Shot Classification	VOC-MLT	Average mAP	85.77	CLIP(ViT-B/16)
Generalized Few-Shot Classification	VOC-MLT	Average mAP	84.3	CLIP(ResNet-50)
Long-tail Learning	COCO-MLT	Average mAP	60.17	CLIP(ViT-B/16)
Long-tail Learning	COCO-MLT	Average mAP	56.19	CLIP(ResNet-50)
Long-tail Learning	VOC-MLT	Average mAP	85.77	CLIP(ViT-B/16)
Long-tail Learning	VOC-MLT	Average mAP	84.3	CLIP(ResNet-50)
Generalized Few-Shot Learning	COCO-MLT	Average mAP	60.17	CLIP(ViT-B/16)
Generalized Few-Shot Learning	COCO-MLT	Average mAP	56.19	CLIP(ResNet-50)
Generalized Few-Shot Learning	VOC-MLT	Average mAP	85.77	CLIP(ViT-B/16)
Generalized Few-Shot Learning	VOC-MLT	Average mAP	84.3	CLIP(ResNet-50)
Zero-Shot Transfer Image Classification	ImageNet V2	Accuracy (Private)	70.1	CLIP
Zero-Shot Transfer Image Classification	ImageNet-A	Accuracy (Private)	77.2	CLIP
Zero-Shot Transfer Image Classification	ImageNet	Accuracy (Private)	76.2	CLIP（ViT-L/14-336px）
Zero-Shot Transfer Image Classification	ImageNet	Accuracy (Private)	59.6	CLIP (ResNet50)
Zero-Shot Transfer Image Classification	ImageNet	Accuracy (Public)	31.3	CLIP
Zero-Shot Transfer Image Classification	ImageNet-R	Accuracy	88.9	CLIP
Zero-Shot Transfer Image Classification	SUN	Accuracy	58.5	CLIP
Zero-Shot Transfer Image Classification	ObjectNet	Accuracy (Private)	72.3	CLIP
Zero-Shot Transfer Image Classification	aYahoo	Accuracy	98.4	CLIP
2D Classification	OVAD-Box benchmark	mean average precision	16.6	CLIP VIT-B16
2D Object Detection	OVAD-Box benchmark	mean average precision	16.6	CLIP VIT-B16
Object Categorization	GRIT	Categorization (ablation)	48.1	CLIP
Prompt Engineering	ImageNet-R	Top-1 accuracy %	73.96	CLIP
Prompt Engineering	Stanford Cars	Harmonic mean	68.65	CLIP
Prompt Engineering	Oxford 102 Flower	Harmonic mean	74.83	CLIP
Prompt Engineering	EuroSAT	Harmonic mean	60.03	CLIP
Prompt Engineering	Oxford-IIIT Pet Dataset	Harmonic mean	94.12	CLIP
Prompt Engineering	ImageNet-S	Top-1 accuracy %	46.15	CLIP
Prompt Engineering	DTD	Harmonic mean	56.37	CLIP
Prompt Engineering	UCF101	Harmonic mean	73.85	CLIP
Prompt Engineering	Caltech-101	Harmonic mean	95.4	CLIP
Prompt Engineering	ImageNet	Harmonic mean	70.22	CLIP
Prompt Engineering	FGVC-Aircraft	Harmonic mean	31.09	CLIP
Prompt Engineering	SUN397	Harmonic mean	72.23	CLIP
Prompt Engineering	ImageNet-A	Top-1 accuracy %	47.77	CLIP
Prompt Engineering	ImageNet V2	Top-1 accuracy %	60.83	CLIP
Open Vocabulary Object Detection	OVAD-Box benchmark	mean average precision	16.6	CLIP VIT-B16
Image-to-Text Retrieval	COCO (Common Objects in Context)	Recall@1	58.4	CLIP (zero-shot)
Image-to-Text Retrieval	COCO (Common Objects in Context)	Recall@10	88.1	CLIP (zero-shot)
Image-to-Text Retrieval	COCO (Common Objects in Context)	Recall@5	81.5	CLIP (zero-shot)
Text-based Person Retrieval with Noisy Correspondence	ICFG-PEDES	Rank 1	55.25	CLIP-C
Text-based Person Retrieval with Noisy Correspondence	ICFG-PEDES	Rank-10	81.32	CLIP-C
Text-based Person Retrieval with Noisy Correspondence	ICFG-PEDES	Rank-5	74.76	CLIP-C
Text-based Person Retrieval with Noisy Correspondence	ICFG-PEDES	mAP	31.09	CLIP-C
Text-based Person Retrieval with Noisy Correspondence	ICFG-PEDES	mINP	4.94	CLIP-C
Text-based Person Retrieval with Noisy Correspondence	RSTPReid	Rank 1	54.45	CLIP-C
Text-based Person Retrieval with Noisy Correspondence	RSTPReid	Rank 10	86.7	CLIP-C
Text-based Person Retrieval with Noisy Correspondence	RSTPReid	Rank 5	77.8	CLIP-C
Text-based Person Retrieval with Noisy Correspondence	RSTPReid	mAP	42.58	CLIP-C
Text-based Person Retrieval with Noisy Correspondence	RSTPReid	mINP	21.38	CLIP-C
Text-based Person Retrieval with Noisy Correspondence	CUHK-PEDES	Rank 10	90.89	CLIP-C
Text-based Person Retrieval with Noisy Correspondence	CUHK-PEDES	Rank-1	66.41	CLIP-C
Text-based Person Retrieval with Noisy Correspondence	CUHK-PEDES	Rank-5	85.15	CLIP-C
Text-based Person Retrieval with Noisy Correspondence	CUHK-PEDES	mAP	59.36	CLIP-C
Text-based Person Retrieval with Noisy Correspondence	CUHK-PEDES	mINP	43.02	CLIP-C
16k	OVAD-Box benchmark	mean average precision	16.6	CLIP VIT-B16

Learning Transferable Visual Models From Natural Language Supervision

Abstract

Results

Related Papers

Learning Transferable Visual Models From Natural Language Supervision

Abstract

Results

Related Papers