TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Discrete Representations Strengthen Vision Transformer Rob...

Discrete Representations Strengthen Vision Transformer Robustness

Chengzhi Mao, Lu Jiang, Mostafa Dehghani, Carl Vondrick, Rahul Sukthankar, Irfan Essa

2021-11-20ICLR 2022 4Image ClassificationDomain Generalization
PaperPDFCode

Abstract

Vision Transformer (ViT) is emerging as the state-of-the-art architecture for image recognition. While recent studies suggest that ViTs are more robust than their convolutional counterparts, our experiments find that ViTs trained on ImageNet are overly reliant on local textures and fail to make adequate use of shape information. ViTs thus have difficulties generalizing to out-of-distribution, real-world data. To address this deficiency, we present a simple and effective architecture modification to ViT's input layer by adding discrete tokens produced by a vector-quantized encoder. Different from the standard continuous pixel tokens, discrete tokens are invariant under small perturbations and contain less information individually, which promote ViTs to learn global information that is invariant. Experimental results demonstrate that adding discrete representation on four architecture variants strengthens ViT robustness by up to 12% across seven ImageNet robustness benchmarks while maintaining the performance on ImageNet.

Results

TaskDatasetMetricValueModel
Domain AdaptationStylized-ImageNetTop 1 Accuracy22.19DiscreteViT
Domain AdaptationImageNet-RTop-1 Error Rate44.74DiscreteViT
Domain AdaptationImageNet-Cmean Corruption Error (mCE)38.74DiscreteViT (Im21k)
Domain AdaptationImageNet-Cmean Corruption Error (mCE)46.22DrViT
Domain AdaptationImageNet-Cmean Corruption Error (mCE)46.22DiscreteViT
Domain AdaptationImageNet-SketchTop-1 accuracy44.72DrViT
Image ClassificationObjectNetTop-1 Accuracy46.62ViT-B (Discrete 512x512)
Domain GeneralizationStylized-ImageNetTop 1 Accuracy22.19DiscreteViT
Domain GeneralizationImageNet-RTop-1 Error Rate44.74DiscreteViT
Domain GeneralizationImageNet-Cmean Corruption Error (mCE)38.74DiscreteViT (Im21k)
Domain GeneralizationImageNet-Cmean Corruption Error (mCE)46.22DrViT
Domain GeneralizationImageNet-Cmean Corruption Error (mCE)46.22DiscreteViT
Domain GeneralizationImageNet-SketchTop-1 accuracy44.72DrViT

Related Papers

Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17Federated Learning for Commercial Image Sources2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17Simulate, Refocus and Ensemble: An Attention-Refocusing Scheme for Domain Generalization2025-07-17GLAD: Generalizable Tuning for Vision-Language Models2025-07-17MoTM: Towards a Foundation Model for Time Series Imputation based on Continuous Modeling2025-07-17