FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization

Pavan Kumar Anasosalu Vasu, James Gabriel, Jeff Zhu, Oncel Tuzel, Anurag Ranjan

2023-03-24ICCV 2023 13D Hand Pose Estimation Image Classification Semantic Segmentation

Paper PDF Code(official)Code(official)Code Code Code Code

Abstract

The recent amalgamation of transformer and convolutional designs has led to steady improvements in accuracy and efficiency of the models. In this work, we introduce FastViT, a hybrid vision transformer architecture that obtains the state-of-the-art latency-accuracy trade-off. To this end, we introduce a novel token mixing operator, RepMixer, a building block of FastViT, that uses structural reparameterization to lower the memory access cost by removing skip-connections in the network. We further apply train-time overparametrization and large kernel convolutions to boost accuracy and empirically show that these choices have minimal effect on latency. We show that - our model is 3.5x faster than CMT, a recent state-of-the-art hybrid transformer architecture, 4.9x faster than EfficientNet, and 1.9x faster than ConvNeXt on a mobile device for the same accuracy on the ImageNet dataset. At similar latency, our model obtains 4.2% better Top-1 accuracy on ImageNet than MobileOne. Our model consistently outperforms competing architectures across several tasks -- image classification, detection, segmentation and 3D mesh regression with significant improvement in latency on both a mobile device and a desktop GPU. Furthermore, our model is highly robust to out-of-distribution samples and corruptions, improving over competing robust models. Code and models are available at https://github.com/apple/ml-fastvit.

Results

Task	Dataset	Metric	Value	Model
Semantic Segmentation	ADE20K	Mean IoU (class)	44.6	FastViT-MA36
Semantic Segmentation	ADE20K	Mean IoU (class)	42.9	FastViT-SA36
Semantic Segmentation	ADE20K	Mean IoU (class)	41	FastViT-SA24
Semantic Segmentation	ADE20K	Mean IoU (class)	38	FastViT-SA12
Hand	FreiHAND	PA-F@15mm	0.981	FastViT-MA36
Hand	FreiHAND	PA-F@5mm	0.722	FastViT-MA36
Hand	FreiHAND	PA-MPJPE	6.6	FastViT-MA36
Hand	FreiHAND	PA-MPVPE	6.7	FastViT-MA36
Pose Estimation	FreiHAND	PA-F@15mm	0.981	FastViT-MA36
Pose Estimation	FreiHAND	PA-F@5mm	0.722	FastViT-MA36
Pose Estimation	FreiHAND	PA-MPJPE	6.6	FastViT-MA36
Pose Estimation	FreiHAND	PA-MPVPE	6.7	FastViT-MA36
Hand Pose Estimation	FreiHAND	PA-F@15mm	0.981	FastViT-MA36
Hand Pose Estimation	FreiHAND	PA-F@5mm	0.722	FastViT-MA36
Hand Pose Estimation	FreiHAND	PA-MPJPE	6.6	FastViT-MA36
Hand Pose Estimation	FreiHAND	PA-MPVPE	6.7	FastViT-MA36
3D	FreiHAND	PA-F@15mm	0.981	FastViT-MA36
3D	FreiHAND	PA-F@5mm	0.722	FastViT-MA36
3D	FreiHAND	PA-MPJPE	6.6	FastViT-MA36
3D	FreiHAND	PA-MPVPE	6.7	FastViT-MA36
3D Hand Pose Estimation	FreiHAND	PA-F@15mm	0.981	FastViT-MA36
3D Hand Pose Estimation	FreiHAND	PA-F@5mm	0.722	FastViT-MA36
3D Hand Pose Estimation	FreiHAND	PA-MPJPE	6.6	FastViT-MA36
3D Hand Pose Estimation	FreiHAND	PA-MPVPE	6.7	FastViT-MA36
10-shot image generation	ADE20K	Mean IoU (class)	44.6	FastViT-MA36
10-shot image generation	ADE20K	Mean IoU (class)	42.9	FastViT-SA36
10-shot image generation	ADE20K	Mean IoU (class)	41	FastViT-SA24
10-shot image generation	ADE20K	Mean IoU (class)	38	FastViT-SA12
1 Image, 2*2 Stitchi	FreiHAND	PA-F@15mm	0.981	FastViT-MA36
1 Image, 2*2 Stitchi	FreiHAND	PA-F@5mm	0.722	FastViT-MA36
1 Image, 2*2 Stitchi	FreiHAND	PA-MPJPE	6.6	FastViT-MA36
1 Image, 2*2 Stitchi	FreiHAND	PA-MPVPE	6.7	FastViT-MA36

FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization

Abstract

Results

Related Papers

FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization

Abstract

Results

Related Papers