TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Adaptive Sparse ViT: Towards Learnable Adaptive Token Prun...

Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully Exploiting Self-Attention

Xiangcheng Liu, Tianyi Wu, Guodong Guo

2022-09-28Informativeness
PaperPDFCode(official)

Abstract

Vision transformer has emerged as a new paradigm in computer vision, showing excellent performance while accompanied by expensive computational cost. Image token pruning is one of the main approaches for ViT compression, due to the facts that the complexity is quadratic with respect to the token number, and many tokens containing only background regions do not truly contribute to the final prediction. Existing works either rely on additional modules to score the importance of individual tokens, or implement a fixed ratio pruning strategy for different input instances. In this work, we propose an adaptive sparse token pruning framework with a minimal cost. Specifically, we firstly propose an inexpensive attention head importance weighted class attention scoring mechanism. Then, learnable parameters are inserted as thresholds to distinguish informative tokens from unimportant ones. By comparing token attention scores and thresholds, we can discard useless tokens hierarchically and thus accelerate inference. The learnable thresholds are optimized in budget-aware training to balance accuracy and complexity, performing the corresponding pruning configurations for different input instances. Extensive experiments demonstrate the effectiveness of our approach. Our method improves the throughput of DeiT-S by 50% and brings only 0.2% drop in top-1 accuracy, which achieves a better trade-off between accuracy and latency than the previous methods.

Results

TaskDatasetMetricValueModel
Image ClassificationImageNet-1K (With LV-ViT-S)GFLOPs4.6AS-LV-S (70%)
Image ClassificationImageNet-1K (With LV-ViT-S)Top 1 Accuracy83.1AS-LV-S (70%)
Image ClassificationImageNet-1K (With LV-ViT-S)GFLOPs3.9AS-LV-S (60%)
Image ClassificationImageNet-1K (With LV-ViT-S)Top 1 Accuracy82.6AS-LV-S (60%)
Image ClassificationImageNet-1K (with DeiT-S)GFLOPs3AS-DeiT-S (65%)
Image ClassificationImageNet-1K (with DeiT-S)Top 1 Accuracy79.6AS-DeiT-S (65%)
Image ClassificationImageNet-1K (with DeiT-S)GFLOPs2.3AS-DeiT-S (50%)
Image ClassificationImageNet-1K (with DeiT-S)Top 1 Accuracy78.7AS-DeiT-S (50%)

Related Papers

Multi-Agent Retrieval-Augmented Framework for Evidence-Based Counterspeech Against Health Misinformation2025-07-09LumiCRS: Asymmetric Contrastive Prototype Learning for Long-Tail Conversational Movie Recommendation2025-07-07Dynamic Bandwidth Allocation for Hybrid Event-RGB Transmission2025-06-25Multi-Preference Lambda-weighted Listwise DPO for Dynamic Preference Alignment2025-06-24CuRe: Cultural Gaps in the Long Tail of Text-to-Image Systems2025-06-09Image Reconstruction as a Tool for Feature Analysis2025-06-09Investigating the Impact of Word Informativeness on Speech Emotion Recognition2025-06-02Assumption-free stability for ranking problems2025-06-02