TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/ALIGN

ALIGN

Computer VisionIntroduced 20005527 papers
Source Paper

Description

In the ALIGN method, visual and language representations are jointly trained from noisy image alt-text data. The image and text encoders are learned via contrastive loss (formulated as normalized softmax) that pushes the embeddings of the matched image-text pair together and pushing those of non-matched image-text pair apart. The model learns to align visual and language representations of the image and text pairs using the contrastive loss. The representations can be used for vision-only or vision-language task transfer. Without any fine-tuning, ALIGN powers zero-shot visual classification and cross-modal search including image-to-text search, text-to image search and even search with joint image+text queries.

Papers Using This Method

SGLoc: Semantic Localization System for Camera Pose Estimation from 3D Gaussian Splatting Representation2025-07-16Task-Oriented Human Grasp Synthesis via Context- and Task-Aware Diffusers2025-07-15Toward Improving fNIRS Classification: A Study on Activation Functions in Deep Neural Architectures2025-07-15Feature Distillation is the Better Choice for Model-Heterogeneous Federated Learning2025-07-14SCOOTER: A Human Evaluation Framework for Unrestricted Adversarial Examples2025-07-10InvestAlign: Overcoming Data Scarcity in Aligning Large Language Models with Investor Decision-Making Processes under Herd Behavior2025-07-09Benchmarking Waitlist Mortality Prediction in Heart Transplantation Through Time-to-Event Modeling using New Longitudinal UNOS Dataset2025-07-09Explainable Artificial Intelligence in Biomedical Image Analysis: A Comprehensive Survey2025-07-09LangMamba: A Language-driven Mamba Framework for Low-dose CT Denoising with Vision-language Models2025-07-08Vers un cadre ontologique pour la gestion des comp{é}tences : {à} des fins de formation, de recrutement, de m{é}tier, ou de recherches associ{é}es2025-07-08ScoreAdv: Score-based Targeted Generation of Natural Adversarial Examples via Diffusion Models2025-07-08MCAM: Multimodal Causal Analysis Model for Ego-Vehicle-Level Driving Video Understanding2025-07-08ADMC: Attention-based Diffusion Model for Missing Modalities Feature Completion2025-07-08Neural-Driven Image Editing2025-07-07Learn Globally, Speak Locally: Bridging the Gaps in Multilingual Reasoning2025-07-07CoT-lized Diffusion: Let's Reinforce T2I Generation Step-by-step2025-07-06Rectifying Adversarial Sample with Low Entropy Prior for Test-Time Defense2025-07-04Hita: Holistic Tokenizer for Autoregressive Image Generation2025-07-03Adopting a human developmental visual diet yields robust, shape-based AI vision2025-07-03De-AntiFake: Rethinking the Protective Perturbations Against Voice Cloning Attacks2025-07-03