TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/CLIP

CLIP

Contrastive Language-Image Pre-training

Computer VisionIntroduced 20003094 papers
Source Paper

Description

Contrastive Language-Image Pre-training (CLIP), consisting of a simplified version of ConVIRT trained from scratch, is an efficient method of image representation learning from natural language supervision. , CLIP jointly trains an image encoder and a text encoder to predict the correct pairings of a batch of (image, text) training examples. At test time the learned text encoder synthesizes a zero-shot linear classifier by embedding the names or descriptions of the target dataset’s classes.

For pre-training, CLIP is trained to predict which of the NXNN X NNXN possible (image, text) pairings across a batch actually occurred. CLIP learns a multi-modal embedding space by jointly training an image encoder and text encoder to maximize the cosine similarity of the image and text embeddings of the NNN real pairs in the batch while minimizing the cosine similarity of the embeddings of the N2−NN^2 - NN2−N incorrect pairings. A symmetric cross entropy loss is optimized over these similarity scores.

Image credit: Learning Transferable Visual Models From Natural Language Supervision

Papers Using This Method

CATVis: Context-Aware Thought Visualization2025-07-15Bridge Feature Matching and Cross-Modal Alignment with Mutual-filtering for Zero-shot Anomaly Detection2025-07-15Text-Visual Semantic Constrained AI-Generated Image Quality Assessment2025-07-14DEARLi: Decoupled Enhancement of Recognition and Localization for Semi-supervised Panoptic Segmentation2025-07-14Test-Time Canonicalization by Foundation Models for Robust Perception2025-07-14Calibrated and Robust Foundation Models for Vision-Language and Medical Image Tasks Under Distribution Shift2025-07-12A Neural Representation Framework with LLM-Driven Spatial Reasoning for Open-Vocabulary 3D Visual Grounding2025-07-09RSRefSeg 2: Decoupling Referring Remote Sensing Image Segmentation with Foundation Models2025-07-08CultureCLIP: Empowering CLIP with Cultural Awareness through Synthetic Images and Contextualized Captions2025-07-08Integrated Structural Prompt Learning for Vision-Language Models2025-07-08Semi-Supervised Defect Detection via Conditional Diffusion and CLIP-Guided Noise Filtering2025-07-08CLIP-Guided Backdoor Defense through Entropy-Based Poisoned Dataset Separation2025-07-07pFedMMA: Personalized Federated Fine-Tuning with Multi-Modal Adapter for Vision-Language Models2025-07-07An analysis of vision-language models for fabric retrieval2025-07-07Beyond Accuracy: Metrics that Uncover What Makes a 'Good' Visual Descriptor2025-07-04Helping CLIP See Both the Forest and the Trees: A Decomposition and Description Approach2025-07-04Prompt Disentanglement via Language Guidance and Representation Alignment for Domain Generalization2025-07-03VoteSplat: Hough Voting Gaussian Splatting for 3D Scene Understanding2025-06-28Q-Frame: Query-aware Frame Selection and Multi-Resolution Adaptation for Video-LLMs2025-06-27Multimodal Prompt Alignment for Facial Expression Recognition2025-06-26