Tasks SotA Datasets Papers Methods Submit About

Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable Benchmarks All SotA Datasets Papers Methods

Community

Submit Results About

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

ALIGN

Computer VisionIntroduced 20005527 papers

Description

In the ALIGN method, visual and language representations are jointly trained from noisy image alt-text data. The image and text encoders are learned via contrastive loss (formulated as normalized softmax) that pushes the embeddings of the matched image-text pair together and pushing those of non-matched image-text pair apart. The model learns to align visual and language representations of the image and text pairs using the contrastive loss. The representations can be used for vision-only or vision-language task transfer. Without any fine-tuning, ALIGN powers zero-shot visual classification and cross-modal search including image-to-text search, text-to image search and even search with joint image+text queries.

Papers Using This Method

SGLoc: Semantic Localization System for Camera Pose Estimation from 3D Gaussian Splatting Representation2025-07-16 Task-Oriented Human Grasp Synthesis via Context- and Task-Aware Diffusers2025-07-15 Toward Improving fNIRS Classification: A Study on Activation Functions in Deep Neural Architectures2025-07-15 Feature Distillation is the Better Choice for Model-Heterogeneous Federated Learning2025-07-14 SCOOTER: A Human Evaluation Framework for Unrestricted Adversarial Examples2025-07-10 InvestAlign: Overcoming Data Scarcity in Aligning Large Language Models with Investor Decision-Making Processes under Herd Behavior2025-07-09 Benchmarking Waitlist Mortality Prediction in Heart Transplantation Through Time-to-Event Modeling using New Longitudinal UNOS Dataset2025-07-09 Explainable Artificial Intelligence in Biomedical Image Analysis: A Comprehensive Survey2025-07-09 LangMamba: A Language-driven Mamba Framework for Low-dose CT Denoising with Vision-language Models2025-07-08 Vers un cadre ontologique pour la gestion des comp{é}tences : {à} des fins de formation, de recrutement, de m{é}tier, ou de recherches associ{é}es2025-07-08 ScoreAdv: Score-based Targeted Generation of Natural Adversarial Examples via Diffusion Models2025-07-08 MCAM: Multimodal Causal Analysis Model for Ego-Vehicle-Level Driving Video Understanding2025-07-08 ADMC: Attention-based Diffusion Model for Missing Modalities Feature Completion2025-07-08 Neural-Driven Image Editing2025-07-07 Learn Globally, Speak Locally: Bridging the Gaps in Multilingual Reasoning2025-07-07 CoT-lized Diffusion: Let's Reinforce T2I Generation Step-by-step2025-07-06 Rectifying Adversarial Sample with Low Entropy Prior for Test-Time Defense2025-07-04 Hita: Holistic Tokenizer for Autoregressive Image Generation2025-07-03 Adopting a human developmental visual diet yields robust, shape-based AI vision2025-07-03 De-AntiFake: Rethinking the Protective Perturbations Against Voice Cloning Attacks2025-07-03