TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Learning to Prompt for Vision-Language Models

Learning to Prompt for Vision-Language Models

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, Ziwei Liu

2021-09-02Representation LearningPrompt EngineeringDomain GeneralizationFew-shot Age Estimation
PaperPDFCodeCodeCodeCodeCodeCodeCode(official)CodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCode

Abstract

Large pre-trained vision-language models like CLIP have shown great potential in learning representations that are transferable across a wide range of downstream tasks. Different from the traditional representation learning that is based mostly on discretized labels, vision-language pre-training aligns images and texts in a common feature space, which allows zero-shot transfer to a downstream task via prompting, i.e., classification weights are synthesized from natural language describing classes of interest. In this work, we show that a major challenge for deploying such models in practice is prompt engineering, which requires domain expertise and is extremely time-consuming -- one needs to spend a significant amount of time on words tuning since a slight change in wording could have a huge impact on performance. Inspired by recent advances in prompt learning research in natural language processing (NLP), we propose Context Optimization (CoOp), a simple approach specifically for adapting CLIP-like vision-language models for downstream image recognition. Concretely, CoOp models a prompt's context words with learnable vectors while the entire pre-trained parameters are kept fixed. To handle different image recognition tasks, we provide two implementations of CoOp: unified context and class-specific context. Through extensive experiments on 11 datasets, we demonstrate that CoOp requires as few as one or two shots to beat hand-crafted prompts with a decent margin and is able to gain significant improvements over prompt engineering with more shots, e.g., with 16 shots the average gain is around 15% (with the highest reaching over 45%). Despite being a learning-based approach, CoOp achieves superb domain generalization performance compared with the zero-shot model using hand-crafted prompts.

Results

TaskDatasetMetricValueModel
Facial Recognition and ModellingMORPH Album2MAE5.09CoOp
Facial Recognition and ModellingMORPH Album2MAE (16 shot)3.23CoOp
Facial Recognition and ModellingMORPH Album2MAE (2 shot)4.5CoOp
Facial Recognition and ModellingMORPH Album2MAE (4 shot)3.81CoOp
Facial Recognition and ModellingMORPH Album2MAE (8 shot)3.57CoOp
Face ReconstructionMORPH Album2MAE5.09CoOp
Face ReconstructionMORPH Album2MAE (16 shot)3.23CoOp
Face ReconstructionMORPH Album2MAE (2 shot)4.5CoOp
Face ReconstructionMORPH Album2MAE (4 shot)3.81CoOp
Face ReconstructionMORPH Album2MAE (8 shot)3.57CoOp
3DMORPH Album2MAE5.09CoOp
3DMORPH Album2MAE (16 shot)3.23CoOp
3DMORPH Album2MAE (2 shot)4.5CoOp
3DMORPH Album2MAE (4 shot)3.81CoOp
3DMORPH Album2MAE (8 shot)3.57CoOp
3D Face ModellingMORPH Album2MAE5.09CoOp
3D Face ModellingMORPH Album2MAE (16 shot)3.23CoOp
3D Face ModellingMORPH Album2MAE (2 shot)4.5CoOp
3D Face ModellingMORPH Album2MAE (4 shot)3.81CoOp
3D Face ModellingMORPH Album2MAE (8 shot)3.57CoOp
3D Face ReconstructionMORPH Album2MAE5.09CoOp
3D Face ReconstructionMORPH Album2MAE (16 shot)3.23CoOp
3D Face ReconstructionMORPH Album2MAE (2 shot)4.5CoOp
3D Face ReconstructionMORPH Album2MAE (4 shot)3.81CoOp
3D Face ReconstructionMORPH Album2MAE (8 shot)3.57CoOp
Age EstimationMORPH Album2MAE5.09CoOp
Age EstimationMORPH Album2MAE (16 shot)3.23CoOp
Age EstimationMORPH Album2MAE (2 shot)4.5CoOp
Age EstimationMORPH Album2MAE (4 shot)3.81CoOp
Age EstimationMORPH Album2MAE (8 shot)3.57CoOp

Related Papers

Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper2025-07-20Spectral Bellman Method: Unifying Representation and Exploration in RL2025-07-17Boosting Team Modeling through Tempo-Relational Representation Learning2025-07-17Leveraging Language Prior for Infrared Small Target Detection2025-07-17Emotional Support with LLM-based Empathetic Dialogue Generation2025-07-17Simulate, Refocus and Ensemble: An Attention-Refocusing Scheme for Domain Generalization2025-07-17GLAD: Generalizable Tuning for Vision-Language Models2025-07-17MoTM: Towards a Foundation Model for Time Series Imputation based on Continuous Modeling2025-07-17