TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Visual Prompt Tuning

Visual Prompt Tuning

Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, Ser-Nam Lim

2022-03-23Image ClassificationLong-tail LearningPrompt EngineeringVisual Prompt Tuning
PaperPDFCode(official)CodeCodeCodeCodeCode

Abstract

The current modus operandi in adapting pre-trained models involves updating all the backbone parameters, ie, full fine-tuning. This paper introduces Visual Prompt Tuning (VPT) as an efficient and effective alternative to full fine-tuning for large-scale Transformer models in vision. Taking inspiration from recent advances in efficiently tuning large language models, VPT introduces only a small amount (less than 1% of model parameters) of trainable parameters in the input space while keeping the model backbone frozen. Via extensive experiments on a wide variety of downstream recognition tasks, we show that VPT achieves significant performance gains compared to other parameter efficient tuning protocols. Most importantly, VPT even outperforms full fine-tuning in many cases across model capacities and training data scales, while reducing per-task storage cost.

Results

TaskDatasetMetricValueModel
Image ClassificationCIFAR-100-LT (ρ=50)Error Rate15.2VPT
Image ClassificationCIFAR-100-LT (ρ=10)Error Rate10.4VPT
Image ClassificationCIFAR-100-LT (ρ=100)Error Rate19VPT
Few-Shot Image ClassificationCIFAR-100-LT (ρ=50)Error Rate15.2VPT
Few-Shot Image ClassificationCIFAR-100-LT (ρ=10)Error Rate10.4VPT
Few-Shot Image ClassificationCIFAR-100-LT (ρ=100)Error Rate19VPT
Generalized Few-Shot ClassificationCIFAR-100-LT (ρ=50)Error Rate15.2VPT
Generalized Few-Shot ClassificationCIFAR-100-LT (ρ=10)Error Rate10.4VPT
Generalized Few-Shot ClassificationCIFAR-100-LT (ρ=100)Error Rate19VPT
Long-tail LearningCIFAR-100-LT (ρ=50)Error Rate15.2VPT
Long-tail LearningCIFAR-100-LT (ρ=10)Error Rate10.4VPT
Long-tail LearningCIFAR-100-LT (ρ=100)Error Rate19VPT
Generalized Few-Shot LearningCIFAR-100-LT (ρ=50)Error Rate15.2VPT
Generalized Few-Shot LearningCIFAR-100-LT (ρ=10)Error Rate10.4VPT
Generalized Few-Shot LearningCIFAR-100-LT (ρ=100)Error Rate19VPT
Prompt EngineeringImageNet-21kAccuracy24.8VPT
Visual Prompt TuningFGVCMean Accuracy83.12VPT-Deep(ViT-B/16_MoCo_v3_pretrained_ImageNet-1K)
Visual Prompt TuningFGVCMean Accuracy79.26VPT-Shallow (ViT-B/16_MoCo_v3_pretrained_ImageNet-1K)
Visual Prompt TuningFGVCMean Accuracy72.02VPT-Deep (ViT-B/16_MAE_pretrained_ImageNet-1K)
Visual Prompt TuningFGVCMean Accuracy57.84VPT-Shallow (ViT-B/16_MAE_pretrained_ImageNet-1K)
Visual Prompt TuningVTAB-1k(Structured<8>)Mean Accuracy42.38VPT-Deep(ViT-B/16_MoCo_v3_pretrained_ImageNet-1K)
Visual Prompt TuningVTAB-1k(Structured<8>)Mean Accuracy37.55VPT-Shallow(ViT-B/16_MoCo_v3_pretrained_ImageNet-1K)
Visual Prompt TuningVTAB-1k(Structured<8>)Mean Accuracy27.5VPT-Shallow(ViT-B/16_MAE_pretrained_ImageNet-1K)
Visual Prompt TuningVTAB-1k(Structured<8>)Mean Accuracy26.57VPT-Deep(ViT-B/16_MAE_pretrained_ImageNet-1K)
Visual Prompt TuningVTAB-1k(Natural<7>)Mean Accuracy70.27VPT-Deep(ViT-B/16_MoCo_v3_pretrained_ImageNet-1K)
Visual Prompt TuningVTAB-1k(Natural<7>)Mean Accuracy67.34VPT-Shallow(ViT-B/16_MoCo_v3_pretrained_ImageNet-1K)
Visual Prompt TuningVTAB-1k(Natural<7>)Mean Accuracy39.96VPT-Shallow(ViT-B/16_MAE_pretrained_ImageNet-1K)
Visual Prompt TuningVTAB-1k(Natural<7>)Mean Accuracy36.02VPT-Deep(ViT-B/16_MAE_pretrained_ImageNet-1K)
Visual Prompt TuningVTAB-1k(Specialized<4>)Mean Accuracy83.04VPT-Deep(ViT-B/16_MoCo_v3_pretrained_ImageNet-1K)
Visual Prompt TuningVTAB-1k(Specialized<4>)Mean Accuracy82.26VPT-Shallow(ViT-B/16_MoCo_v3_pretrained_ImageNet-1K)
Visual Prompt TuningVTAB-1k(Specialized<4>)Mean Accuracy69.65VPT-Shallow(ViT-B/16_MAE_pretrained_ImageNet-1K)
Visual Prompt TuningVTAB-1k(Specialized<4>)Mean Accuracy60.61VPT-Deep(ViT-B/16_MAE_pretrained_ImageNet-1K)

Related Papers

Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17Federated Learning for Commercial Image Sources2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17Leveraging Language Prior for Infrared Small Target Detection2025-07-17Emotional Support with LLM-based Empathetic Dialogue Generation2025-07-17Hashed Watermark as a Filter: Defeating Forging and Overwriting Attacks in Weight-based Neural Network Watermarking2025-07-15