LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference

Ben Graham, Alaaeldin El-Nouby, Hugo Touvron, Pierre Stock, Armand Joulin, Hervé Jégou, Matthijs Douze

2021-04-02ICCV 2021 10Image Classification General Classification

Paper PDF Code Code Code Code(official)Code(official)Code Code Code Code Code Code Code

Abstract

We design a family of image classification architectures that optimize the trade-off between accuracy and efficiency in a high-speed regime. Our work exploits recent findings in attention-based architectures, which are competitive on highly parallel processing hardware. We revisit principles from the extensive literature on convolutional neural networks to apply them to transformers, in particular activation maps with decreasing resolutions. We also introduce the attention bias, a new way to integrate positional information in vision transformers. As a result, we propose LeVIT: a hybrid neural network for fast inference image classification. We consider different measures of efficiency on different hardware platforms, so as to best reflect a wide range of application scenarios. Our extensive experiments empirically validate our technical choices and show they are suitable to most architectures. Overall, LeViT significantly outperforms existing convnets and vision transformers with respect to the speed/accuracy tradeoff. For example, at 80% ImageNet top-1 accuracy, LeViT is 5 times faster than EfficientNet on CPU. We release the code at https://github.com/facebookresearch/LeViT

Results

Task	Dataset	Metric	Value	Model
Image Classification	Stanford Cars	Accuracy	89.8	LeViT-192
Image Classification	Stanford Cars	Accuracy	89.3	LeViT-384
Image Classification	Stanford Cars	Accuracy	88.6	LeViT-128
Image Classification	Stanford Cars	Accuracy	88.4	LeViT-128S
Image Classification	Stanford Cars	Accuracy	88.2	LeViT-256
Image Classification	ImageNet V2	Top 1 Accuracy	71.4	LeViT-384
Image Classification	ImageNet V2	Top 1 Accuracy	69.9	LeViT-256
Image Classification	ImageNet V2	Top 1 Accuracy	68.7	LeViT-192
Image Classification	ImageNet V2	Top 1 Accuracy	67.5	LeViT-128
Image Classification	ImageNet V2	Top 1 Accuracy	63.9	LeViT-128S
Image Classification	CIFAR-10	Percentage correct	98.2	LeViT-192
Image Classification	CIFAR-10	Percentage correct	98.1	LeViT-256
Image Classification	CIFAR-10	Percentage correct	98	LeViT-384
Image Classification	CIFAR-10	Percentage correct	97.6	LeViT-128
Image Classification	CIFAR-10	Percentage correct	97.5	LeViT-128S
Image Classification	Flowers-102	Accuracy	98.3	LeViT-384
Image Classification	Flowers-102	Accuracy	97.8	LeViT-192
Image Classification	Flowers-102	Accuracy	97.7	LeViT-256
Image Classification	Flowers-102	Accuracy	96.8	LeViT-128S
Image Classification	iNaturalist 2019	Top-1 Accuracy	74.3	LeViT-384
Image Classification	iNaturalist 2019	Top-1 Accuracy	72.3	LeViT-256
Image Classification	iNaturalist 2019	Top-1 Accuracy	70.8	LeViT-192
Image Classification	iNaturalist 2019	Top-1 Accuracy	68.4	LeViT-128
Image Classification	iNaturalist 2019	Top-1 Accuracy	66.5	LeViT-128S
Image Classification	ImageNet	GFLOPs	2.334	LeViT-384
Image Classification	ImageNet	GFLOPs	1.066	LeViT-256
Image Classification	ImageNet	GFLOPs	0.624	LeViT-192
Image Classification	ImageNet	GFLOPs	0.376	LeViT-128
Image Classification	ImageNet	GFLOPs	0.288	LeViT-128S

LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference

Abstract

Results

Related Papers

LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference

Abstract

Results

Related Papers