TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/FastViT: A Fast Hybrid Vision Transformer using Structural...

FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization

Pavan Kumar Anasosalu Vasu, James Gabriel, Jeff Zhu, Oncel Tuzel, Anurag Ranjan

2023-03-24ICCV 2023 13D Hand Pose EstimationImage ClassificationSemantic Segmentation
PaperPDFCode(official)Code(official)CodeCodeCodeCode

Abstract

The recent amalgamation of transformer and convolutional designs has led to steady improvements in accuracy and efficiency of the models. In this work, we introduce FastViT, a hybrid vision transformer architecture that obtains the state-of-the-art latency-accuracy trade-off. To this end, we introduce a novel token mixing operator, RepMixer, a building block of FastViT, that uses structural reparameterization to lower the memory access cost by removing skip-connections in the network. We further apply train-time overparametrization and large kernel convolutions to boost accuracy and empirically show that these choices have minimal effect on latency. We show that - our model is 3.5x faster than CMT, a recent state-of-the-art hybrid transformer architecture, 4.9x faster than EfficientNet, and 1.9x faster than ConvNeXt on a mobile device for the same accuracy on the ImageNet dataset. At similar latency, our model obtains 4.2% better Top-1 accuracy on ImageNet than MobileOne. Our model consistently outperforms competing architectures across several tasks -- image classification, detection, segmentation and 3D mesh regression with significant improvement in latency on both a mobile device and a desktop GPU. Furthermore, our model is highly robust to out-of-distribution samples and corruptions, improving over competing robust models. Code and models are available at https://github.com/apple/ml-fastvit.

Results

TaskDatasetMetricValueModel
Semantic SegmentationADE20KMean IoU (class)44.6FastViT-MA36
Semantic SegmentationADE20KMean IoU (class)42.9FastViT-SA36
Semantic SegmentationADE20KMean IoU (class)41FastViT-SA24
Semantic SegmentationADE20KMean IoU (class)38FastViT-SA12
HandFreiHANDPA-F@15mm0.981FastViT-MA36
HandFreiHANDPA-F@5mm0.722FastViT-MA36
HandFreiHANDPA-MPJPE6.6FastViT-MA36
HandFreiHANDPA-MPVPE6.7FastViT-MA36
Pose EstimationFreiHANDPA-F@15mm0.981FastViT-MA36
Pose EstimationFreiHANDPA-F@5mm0.722FastViT-MA36
Pose EstimationFreiHANDPA-MPJPE6.6FastViT-MA36
Pose EstimationFreiHANDPA-MPVPE6.7FastViT-MA36
Hand Pose EstimationFreiHANDPA-F@15mm0.981FastViT-MA36
Hand Pose EstimationFreiHANDPA-F@5mm0.722FastViT-MA36
Hand Pose EstimationFreiHANDPA-MPJPE6.6FastViT-MA36
Hand Pose EstimationFreiHANDPA-MPVPE6.7FastViT-MA36
3DFreiHANDPA-F@15mm0.981FastViT-MA36
3DFreiHANDPA-F@5mm0.722FastViT-MA36
3DFreiHANDPA-MPJPE6.6FastViT-MA36
3DFreiHANDPA-MPVPE6.7FastViT-MA36
3D Hand Pose EstimationFreiHANDPA-F@15mm0.981FastViT-MA36
3D Hand Pose EstimationFreiHANDPA-F@5mm0.722FastViT-MA36
3D Hand Pose EstimationFreiHANDPA-MPJPE6.6FastViT-MA36
3D Hand Pose EstimationFreiHANDPA-MPVPE6.7FastViT-MA36
10-shot image generationADE20KMean IoU (class)44.6FastViT-MA36
10-shot image generationADE20KMean IoU (class)42.9FastViT-SA36
10-shot image generationADE20KMean IoU (class)41FastViT-SA24
10-shot image generationADE20KMean IoU (class)38FastViT-SA12
1 Image, 2*2 StitchiFreiHANDPA-F@15mm0.981FastViT-MA36
1 Image, 2*2 StitchiFreiHANDPA-F@5mm0.722FastViT-MA36
1 Image, 2*2 StitchiFreiHANDPA-MPJPE6.6FastViT-MA36
1 Image, 2*2 StitchiFreiHANDPA-MPVPE6.7FastViT-MA36

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17Federated Learning for Commercial Image Sources2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17