TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/DynamicViT: Efficient Vision Transformers with Dynamic Tok...

DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification

Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie zhou, Cho-Jui Hsieh

2021-06-03NeurIPS 2021 12Image ClassificationBlocking
PaperPDFCodeCode(official)

Abstract

Attention is sparse in vision transformers. We observe the final prediction in vision transformers is only based on a subset of most informative tokens, which is sufficient for accurate image recognition. Based on this observation, we propose a dynamic token sparsification framework to prune redundant tokens progressively and dynamically based on the input. Specifically, we devise a lightweight prediction module to estimate the importance score of each token given the current features. The module is added to different layers to prune redundant tokens hierarchically. To optimize the prediction module in an end-to-end manner, we propose an attention masking strategy to differentiably prune a token by blocking its interactions with other tokens. Benefiting from the nature of self-attention, the unstructured sparse tokens are still hardware friendly, which makes our framework easy to achieve actual speed-up. By hierarchically pruning 66% of the input tokens, our method greatly reduces 31%~37% FLOPs and improves the throughput by over 40% while the drop of accuracy is within 0.5% for various vision transformers. Equipped with the dynamic token sparsification framework, DynamicViT models can achieve very competitive complexity/accuracy trade-offs compared to state-of-the-art CNNs and vision transformers on ImageNet. Code is available at https://github.com/raoyongming/DynamicViT

Results

TaskDatasetMetricValueModel
Image ClassificationImageNetTop 1 Accuracy83.9DynamicViT-LV-M/0.8
Image ClassificationImageNet-1K (With LV-ViT-S)GFLOPs5.8DynamicViT (90%)
Image ClassificationImageNet-1K (With LV-ViT-S)Top 1 Accuracy83.3DynamicViT (90%)
Image ClassificationImageNet-1K (With LV-ViT-S)GFLOPs5.1DynamicViT (80%)
Image ClassificationImageNet-1K (With LV-ViT-S)Top 1 Accuracy83.2DynamicViT (80%)
Image ClassificationImageNet-1K (With LV-ViT-S)GFLOPs4.6DynamicViT (70%)
Image ClassificationImageNet-1K (With LV-ViT-S)Top 1 Accuracy83DynamicViT (70%)
Image ClassificationImageNet-1K (with DeiT-S)GFLOPs3.4DynamicViT (80%)
Image ClassificationImageNet-1K (with DeiT-S)Top 1 Accuracy79.8DynamicViT (80%)
Image ClassificationImageNet-1K (with DeiT-S)GFLOPs4DynamicViT (90%)
Image ClassificationImageNet-1K (with DeiT-S)Top 1 Accuracy79.8DynamicViT (90%)
Image ClassificationImageNet-1K (with DeiT-S)GFLOPs2.9DynamicViT (70%)
Image ClassificationImageNet-1K (with DeiT-S)Top 1 Accuracy79.3DynamicViT (70%)

Related Papers

Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17Federated Learning for Commercial Image Sources2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17Hashed Watermark as a Filter: Defeating Forging and Overwriting Attacks in Weight-based Neural Network Watermarking2025-07-15Transferring Styles for Reduced Texture Bias and Improved Robustness in Semantic Segmentation Networks2025-07-14FedGSCA: Medical Federated Learning with Global Sample Selector and Client Adaptive Adjuster under Label Noise2025-07-13