Tasks SotA Datasets Papers Methods Submit About

Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable Benchmarks All SotA Datasets Papers Methods

Community

Submit Results About

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

CLIP

Contrastive Language-Image Pre-training

Computer VisionIntroduced 20003094 papers

Description

Contrastive Language-Image Pre-training (CLIP), consisting of a simplified version of ConVIRT trained from scratch, is an efficient method of image representation learning from natural language supervision. , CLIP jointly trains an image encoder and a text encoder to predict the correct pairings of a batch of (image, text) training examples. At test time the learned text encoder synthesizes a zero-shot linear classifier by embedding the names or descriptions of the target dataset’s classes.

For pre-training, CLIP is trained to predict which of the $N X N$ possible (image, text) pairings across a batch actually occurred. CLIP learns a multi-modal embedding space by jointly training an image encoder and text encoder to maximize the cosine similarity of the image and text embeddings of the $N$ real pairs in the batch while minimizing the cosine similarity of the embeddings of the $N^2 - N$ incorrect pairings. A symmetric cross entropy loss is optimized over these similarity scores.

Image credit: Learning Transferable Visual Models From Natural Language Supervision

Papers Using This Method

CATVis: Context-Aware Thought Visualization2025-07-15 Bridge Feature Matching and Cross-Modal Alignment with Mutual-filtering for Zero-shot Anomaly Detection2025-07-15 Text-Visual Semantic Constrained AI-Generated Image Quality Assessment2025-07-14 DEARLi: Decoupled Enhancement of Recognition and Localization for Semi-supervised Panoptic Segmentation2025-07-14 Test-Time Canonicalization by Foundation Models for Robust Perception2025-07-14 Calibrated and Robust Foundation Models for Vision-Language and Medical Image Tasks Under Distribution Shift2025-07-12 A Neural Representation Framework with LLM-Driven Spatial Reasoning for Open-Vocabulary 3D Visual Grounding2025-07-09 RSRefSeg 2: Decoupling Referring Remote Sensing Image Segmentation with Foundation Models2025-07-08 CultureCLIP: Empowering CLIP with Cultural Awareness through Synthetic Images and Contextualized Captions2025-07-08 Integrated Structural Prompt Learning for Vision-Language Models2025-07-08 Semi-Supervised Defect Detection via Conditional Diffusion and CLIP-Guided Noise Filtering2025-07-08 CLIP-Guided Backdoor Defense through Entropy-Based Poisoned Dataset Separation2025-07-07 pFedMMA: Personalized Federated Fine-Tuning with Multi-Modal Adapter for Vision-Language Models2025-07-07 An analysis of vision-language models for fabric retrieval2025-07-07 Beyond Accuracy: Metrics that Uncover What Makes a 'Good' Visual Descriptor2025-07-04 Helping CLIP See Both the Forest and the Trees: A Decomposition and Description Approach2025-07-04 Prompt Disentanglement via Language Guidance and Representation Alignment for Domain Generalization2025-07-03 VoteSplat: Hough Voting Gaussian Splatting for 3D Scene Understanding2025-06-28 Q-Frame: Query-aware Frame Selection and Multi-Resolution Adaptation for Video-LLMs2025-06-27 Multimodal Prompt Alignment for Facial Expression Recognition2025-06-26