TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/When Vision Transformers Outperform ResNets without Pre-tr...

When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations

Xiangning Chen, Cho-Jui Hsieh, Boqing Gong

2021-06-03ICLR 2022 4Image ClassificationDomain GeneralizationTransfer LearningFine-Grained Image Classification
PaperPDFCode(official)Code

Abstract

Vision Transformers (ViTs) and MLPs signal further efforts on replacing hand-wired features or inductive biases with general-purpose neural architectures. Existing works empower the models by massive data, such as large-scale pre-training and/or repeated strong data augmentations, and still report optimization-related problems (e.g., sensitivity to initialization and learning rates). Hence, this paper investigates ViTs and MLP-Mixers from the lens of loss geometry, intending to improve the models' data efficiency at training and generalization at inference. Visualization and Hessian reveal extremely sharp local minima of converged models. By promoting smoothness with a recently proposed sharpness-aware optimizer, we substantially improve the accuracy and robustness of ViTs and MLP-Mixers on various tasks spanning supervised, adversarial, contrastive, and transfer learning (e.g., +5.3\% and +11.0\% top-1 accuracy on ImageNet for ViT-B/16 and Mixer-B/16, respectively, with the simple Inception-style preprocessing). We show that the improved smoothness attributes to sparser active neurons in the first few layers. The resultant ViTs outperform ResNets of similar size and throughput when trained from scratch on ImageNet without large-scale pre-training or strong data augmentations. Model checkpoints are available at \url{https://github.com/google-research/vision_transformer}.

Results

TaskDatasetMetricValueModel
Domain AdaptationImageNet-RTop-1 Error Rate71.9ResNet-152x2-SAM
Domain AdaptationImageNet-RTop-1 Error Rate73.6ViT-B/16-SAM
Domain AdaptationImageNet-RTop-1 Error Rate76.5Mixer-B/8-SAM
Domain AdaptationImageNet-CTop 1 Accuracy56.5ViT-B/16-SAM
Domain AdaptationImageNet-CTop 1 Accuracy55ResNet-152x2-SAM
Domain AdaptationImageNet-CTop 1 Accuracy48.9Mixer-B/8-SAM
Image ClassificationImageNet V2Top 1 Accuracy69.6ResNet-152x2-SAM
Image ClassificationImageNet V2Top 1 Accuracy67.5ViT-B/16-SAM
Image ClassificationImageNet V2Top 1 Accuracy65.5Mixer-B/8-SAM
Image ClassificationCIFAR-10Percentage correct98.6ViT-B/16- SAM
Image ClassificationCIFAR-10Percentage correct98.2ResNet-152-SAM
Image ClassificationCIFAR-10Percentage correct98.2ViT-S/16- SAM
Image ClassificationCIFAR-10Percentage correct97.8Mixer-B/16- SAM
Image ClassificationCIFAR-10Percentage correct97.4ResNet-50-SAM
Image ClassificationCIFAR-10Percentage correct96.1Mixer-S/16- SAM
Image ClassificationFlowers-102Accuracy91.8ViT-B/16- SAM
Image ClassificationFlowers-102Accuracy91.5ViT-S/16- SAM
Image ClassificationFlowers-102Accuracy91.1ResNet-152-SAM
Image ClassificationFlowers-102Accuracy90ResNet-50-SAM
Image ClassificationFlowers-102Accuracy90Mixer-B/16- SAM
Image ClassificationFlowers-102Accuracy87.9Mixer-S/16- SAM
Image ClassificationCIFAR-100Percentage correct89.1ViT-B/16- SAM
Image ClassificationCIFAR-100Percentage correct87.6ViT-S/16- SAM
Image ClassificationCIFAR-100Percentage correct86.4Mixer-B/16- SAM
Image ClassificationCIFAR-100Percentage correct85.2ResNet-50-SAM
Image ClassificationCIFAR-100Percentage correct82.4Mixer-S/16- SAM
Image ClassificationOxford-IIIT PetsAccuracy93.3ResNet-152-SAM
Image ClassificationOxford-IIIT PetsAccuracy93.1ViT-B/16- SAM
Image ClassificationOxford-IIIT PetsAccuracy92.9ViT-S/16- SAM
Image ClassificationOxford-IIIT PetsAccuracy92.5Mixer-B/16- SAM
Image ClassificationOxford-IIIT PetsAccuracy91.6ResNet-50-SAM
Image ClassificationOxford-IIIT PetsAccuracy88.7Mixer-S/16- SAM
Fine-Grained Image ClassificationOxford-IIIT PetsAccuracy93.3ResNet-152-SAM
Fine-Grained Image ClassificationOxford-IIIT PetsAccuracy93.1ViT-B/16- SAM
Fine-Grained Image ClassificationOxford-IIIT PetsAccuracy92.9ViT-S/16- SAM
Fine-Grained Image ClassificationOxford-IIIT PetsAccuracy92.5Mixer-B/16- SAM
Fine-Grained Image ClassificationOxford-IIIT PetsAccuracy91.6ResNet-50-SAM
Fine-Grained Image ClassificationOxford-IIIT PetsAccuracy88.7Mixer-S/16- SAM
Domain GeneralizationImageNet-RTop-1 Error Rate71.9ResNet-152x2-SAM
Domain GeneralizationImageNet-RTop-1 Error Rate73.6ViT-B/16-SAM
Domain GeneralizationImageNet-RTop-1 Error Rate76.5Mixer-B/8-SAM
Domain GeneralizationImageNet-CTop 1 Accuracy56.5ViT-B/16-SAM
Domain GeneralizationImageNet-CTop 1 Accuracy55ResNet-152x2-SAM
Domain GeneralizationImageNet-CTop 1 Accuracy48.9Mixer-B/8-SAM

Related Papers

Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18RaMen: Multi-Strategy Multi-Modal Learning for Bundle Construction2025-07-18Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17Federated Learning for Commercial Image Sources2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17Simulate, Refocus and Ensemble: An Attention-Refocusing Scheme for Domain Generalization2025-07-17GLAD: Generalizable Tuning for Vision-Language Models2025-07-17