TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Learning Transferable Visual Models From Natural Language ...

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever

2021-02-26Zero-Shot Cross-Modal RetrievalBenchmarkingText GenerationHateful Meme ClassificationText-based Person Retrieval with Noisy CorrespondenceOpen Vocabulary Attribute DetectionImage ClassificationObject CategorizationLong-tail Learninggeo-localizationNatural Language UnderstandingPrompt EngineeringObject RecognitionFew-Shot Image ClassificationPreference MappingZero-shot Text-to-Image RetrievalVisual ReasoningZero-Shot Transfer Image ClassificationImage-to-Text RetrievalAction RecognitionMeme ClassificationTemporal Relation ExtractionZero-Shot LearningOut-of-Distribution GeneralizationSemi-Supervised Image Classification
PaperPDFCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCode(official)CodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCode

Abstract

State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at https://github.com/OpenAI/CLIP.

Results

TaskDatasetMetricValueModel
Zero-Shot LearningVOC-MLTAverage mAP84.3CLIP(ResNet-50)
Zero-Shot LearningVOC-MLTAverage mAP85.77CLIP(ViT-B/16)
Zero-Shot LearningCOCO-MLTAverage mAP56.19ResNet-50
Zero-Shot LearningCOCO-MLTAverage mAP60.17ViT-B/16
Activity RecognitionRareActmWAP40.7CLIP
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@188CLIP
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@1099.4CLIP
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@598.7CLIP
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@168.7CLIP
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@1095.2CLIP
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@590.6CLIP
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@158.4CLIP
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@1088.1CLIP
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@581.5CLIP
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@137.8CLIP
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@1072.2CLIP
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@562.4CLIP
Object DetectionOVAD-Box benchmarkmean average precision16.6CLIP VIT-B16
Image ClassificationOmniBenchmarkAverage Top-1 Accuracy42.1CLIP-RN50
Image ClassificationObjectNetTop-1 Accuracy72.3CLIP
Image ClassificationCOCO-MLTAverage mAP60.17CLIP(ViT-B/16)
Image ClassificationCOCO-MLTAverage mAP56.19CLIP(ResNet-50)
Image ClassificationVOC-MLTAverage mAP85.77CLIP(ViT-B/16)
Image ClassificationVOC-MLTAverage mAP84.3CLIP(ResNet-50)
3DOVAD-Box benchmarkmean average precision16.6CLIP VIT-B16
Action RecognitionRareActmWAP40.7CLIP
Object Recognitionshape biasshape bias79.9CLIP (ViT-B)
Few-Shot Image ClassificationCOCO-MLTAverage mAP60.17CLIP(ViT-B/16)
Few-Shot Image ClassificationCOCO-MLTAverage mAP56.19CLIP(ResNet-50)
Few-Shot Image ClassificationVOC-MLTAverage mAP85.77CLIP(ViT-B/16)
Few-Shot Image ClassificationVOC-MLTAverage mAP84.3CLIP(ResNet-50)
Meme ClassificationHateful MemesROC-AUC0.661CLIP (zero-shot)
Meme ClassificationMultiOFFAccuracy62.4CLIP
Meme ClassificationMultiOFFF148.1CLIP
Meme ClassificationHarm-PAccuracy80.6CLIP
Meme ClassificationHarm-PF180.3CLIP
Meme ClassificationPrideMMAccuracy72.4CLIP (fine-tuned)
Meme ClassificationPrideMMF172.3CLIP (fine-tuned)
Generalized Few-Shot ClassificationCOCO-MLTAverage mAP60.17CLIP(ViT-B/16)
Generalized Few-Shot ClassificationCOCO-MLTAverage mAP56.19CLIP(ResNet-50)
Generalized Few-Shot ClassificationVOC-MLTAverage mAP85.77CLIP(ViT-B/16)
Generalized Few-Shot ClassificationVOC-MLTAverage mAP84.3CLIP(ResNet-50)
Long-tail LearningCOCO-MLTAverage mAP60.17CLIP(ViT-B/16)
Long-tail LearningCOCO-MLTAverage mAP56.19CLIP(ResNet-50)
Long-tail LearningVOC-MLTAverage mAP85.77CLIP(ViT-B/16)
Long-tail LearningVOC-MLTAverage mAP84.3CLIP(ResNet-50)
Generalized Few-Shot LearningCOCO-MLTAverage mAP60.17CLIP(ViT-B/16)
Generalized Few-Shot LearningCOCO-MLTAverage mAP56.19CLIP(ResNet-50)
Generalized Few-Shot LearningVOC-MLTAverage mAP85.77CLIP(ViT-B/16)
Generalized Few-Shot LearningVOC-MLTAverage mAP84.3CLIP(ResNet-50)
Zero-Shot Transfer Image ClassificationImageNet V2Accuracy (Private)70.1CLIP
Zero-Shot Transfer Image ClassificationImageNet-AAccuracy (Private)77.2CLIP
Zero-Shot Transfer Image ClassificationImageNetAccuracy (Private)76.2CLIP(ViT-L/14-336px)
Zero-Shot Transfer Image ClassificationImageNetAccuracy (Private)59.6CLIP (ResNet50)
Zero-Shot Transfer Image ClassificationImageNetAccuracy (Public)31.3CLIP
Zero-Shot Transfer Image ClassificationImageNet-RAccuracy88.9CLIP
Zero-Shot Transfer Image ClassificationSUNAccuracy58.5CLIP
Zero-Shot Transfer Image ClassificationObjectNetAccuracy (Private)72.3CLIP
Zero-Shot Transfer Image ClassificationaYahooAccuracy98.4CLIP
2D ClassificationOVAD-Box benchmarkmean average precision16.6CLIP VIT-B16
2D Object DetectionOVAD-Box benchmarkmean average precision16.6CLIP VIT-B16
Object CategorizationGRITCategorization (ablation)48.1CLIP
Prompt EngineeringImageNet-RTop-1 accuracy %73.96CLIP
Prompt EngineeringStanford CarsHarmonic mean68.65CLIP
Prompt EngineeringOxford 102 FlowerHarmonic mean74.83CLIP
Prompt EngineeringEuroSATHarmonic mean60.03CLIP
Prompt EngineeringOxford-IIIT Pet DatasetHarmonic mean94.12CLIP
Prompt EngineeringImageNet-STop-1 accuracy %46.15CLIP
Prompt EngineeringDTDHarmonic mean56.37CLIP
Prompt EngineeringUCF101Harmonic mean73.85CLIP
Prompt EngineeringCaltech-101Harmonic mean95.4CLIP
Prompt EngineeringImageNetHarmonic mean70.22CLIP
Prompt EngineeringFGVC-AircraftHarmonic mean31.09CLIP
Prompt EngineeringSUN397Harmonic mean72.23CLIP
Prompt EngineeringImageNet-ATop-1 accuracy %47.77CLIP
Prompt EngineeringImageNet V2Top-1 accuracy %60.83CLIP
Open Vocabulary Object DetectionOVAD-Box benchmarkmean average precision16.6CLIP VIT-B16
Image-to-Text RetrievalCOCO (Common Objects in Context)Recall@158.4CLIP (zero-shot)
Image-to-Text RetrievalCOCO (Common Objects in Context)Recall@1088.1CLIP (zero-shot)
Image-to-Text RetrievalCOCO (Common Objects in Context)Recall@581.5CLIP (zero-shot)
Text-based Person Retrieval with Noisy CorrespondenceICFG-PEDESRank 155.25CLIP-C
Text-based Person Retrieval with Noisy CorrespondenceICFG-PEDESRank-1081.32CLIP-C
Text-based Person Retrieval with Noisy CorrespondenceICFG-PEDESRank-574.76CLIP-C
Text-based Person Retrieval with Noisy CorrespondenceICFG-PEDESmAP31.09CLIP-C
Text-based Person Retrieval with Noisy CorrespondenceICFG-PEDESmINP4.94CLIP-C
Text-based Person Retrieval with Noisy CorrespondenceRSTPReidRank 154.45CLIP-C
Text-based Person Retrieval with Noisy CorrespondenceRSTPReidRank 1086.7CLIP-C
Text-based Person Retrieval with Noisy CorrespondenceRSTPReidRank 577.8CLIP-C
Text-based Person Retrieval with Noisy CorrespondenceRSTPReidmAP42.58CLIP-C
Text-based Person Retrieval with Noisy CorrespondenceRSTPReidmINP21.38CLIP-C
Text-based Person Retrieval with Noisy CorrespondenceCUHK-PEDESRank 1090.89CLIP-C
Text-based Person Retrieval with Noisy CorrespondenceCUHK-PEDESRank-166.41CLIP-C
Text-based Person Retrieval with Noisy CorrespondenceCUHK-PEDESRank-585.15CLIP-C
Text-based Person Retrieval with Noisy CorrespondenceCUHK-PEDESmAP59.36CLIP-C
Text-based Person Retrieval with Noisy CorrespondenceCUHK-PEDESmINP43.02CLIP-C
16kOVAD-Box benchmarkmean average precision16.6CLIP VIT-B16

Related Papers

Visual Place Recognition for Large-Scale UAV Applications2025-07-20Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18Training Transformers with Enforced Lipschitz Constants2025-07-17Disentangling coincident cell events using deep transfer learning and compressive sensing2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17Making Language Model a Hierarchical Classifier and Generator2025-07-17Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17