TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Token Merging: Your ViT But Faster

Token Merging: Your ViT But Faster

Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, Judy Hoffman

2022-10-17
PaperPDFCodeCodeCode(official)CodeCode

Abstract

We introduce Token Merging (ToMe), a simple method to increase the throughput of existing ViT models without needing to train. ToMe gradually combines similar tokens in a transformer using a general and light-weight matching algorithm that is as fast as pruning while being more accurate. Off-the-shelf, ToMe can 2x the throughput of state-of-the-art ViT-L @ 512 and ViT-H @ 518 models on images and 2.2x the throughput of ViT-L on video with only a 0.2-0.3% accuracy drop in each case. ToMe can also easily be applied during training, improving in practice training speed up to 2x for MAE fine-tuning on video. Training with ToMe further minimizes accuracy drop, leading to 2x the throughput of ViT-B on audio for only a 0.4% mAP drop. Qualitatively, we find that ToMe merges object parts into one token, even over multiple frames of video. Overall, ToMe's accuracy and speed are competitive with state-of-the-art on images, video, and audio.

Results

TaskDatasetMetricValueModel
Image ClassificationImageNet-1K (with DeiT-S)GFLOPs3.4ToMe ($r=8$)
Image ClassificationImageNet-1K (with DeiT-S)Top 1 Accuracy79.7ToMe ($r=8$)
Image ClassificationImageNet-1K (with DeiT-S)GFLOPs2.7ToMe ($r=13$)
Image ClassificationImageNet-1K (with DeiT-S)Top 1 Accuracy79.4ToMe ($r=13$)
Image ClassificationImageNet-1K (with DeiT-S)GFLOPs2.3ToMe ($r=16$)
Image ClassificationImageNet-1K (with DeiT-S)Top 1 Accuracy79.1ToMe ($r=16$)
Image ClassificationImageNet-1K (with DeiT-T)GFLOPs0.9ToMe ($r=8$)
Image ClassificationImageNet-1K (with DeiT-T)Top 1 Accuracy71.7ToMe ($r=8$)
Image ClassificationImageNet-1K (with DeiT-T)GFLOPs0.8ToMe ($r=12$)
Image ClassificationImageNet-1K (with DeiT-T)Top 1 Accuracy71.4ToMe ($r=12$)
Image ClassificationImageNet-1K (with DeiT-T)GFLOPs0.6ToMe ($r=16$)
Image ClassificationImageNet-1K (with DeiT-T)Top 1 Accuracy70.7ToMe ($r=16$)