An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby

2020-10-22ICLR 2021 1Semantic Segmentation

Abstract

While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

Results

Task	Dataset	Metric	Value	Model
Domain Adaptation	VizWiz-Classification	Accuracy - All Images	49	ViT-16/L-224
Domain Adaptation	VizWiz-Classification	Accuracy - Clean Images	450	ViT-8/B-224
Image Classification	ObjectNet	Top-5 Accuracy	82.1	ViT-H/14
Image Classification	CIFAR-10	Percentage correct	99.5	ViT-H/14
Image Classification	CIFAR-10	Percentage correct	99.42	ViT-L/16
Image Classification	Flowers-102	Accuracy	99.68
Domain Generalization	VizWiz-Classification	Accuracy - All Images	49	ViT-16/L-224
Domain Generalization	VizWiz-Classification	Accuracy - Clean Images	450	ViT-8/B-224

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21 DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17 SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17 Unified Medical Image Segmentation with State Space Modeling Snake2025-07-17 A Privacy-Preserving Semantic-Segmentation Method Using Domain-Adaptation Technique2025-07-17 SAMST: A Transformer framework based on SAM pseudo label filtering for remote sensing semi-supervised semantic segmentation2025-07-16 Tomato Multi-Angle Multi-Pose Dataset for Fine-Grained Phenotyping2025-07-15 U-RWKV: Lightweight medical image segmentation with direction-adaptive RWKV2025-07-15