We introduce Token Merging (ToMe), a simple method to increase the throughput of existing ViT models without needing to train. ToMe gradually combines similar tokens in a transformer using a general and light-weight matching algorithm that is as fast as pruning while being more accurate. Off-the-shelf, ToMe can 2x the throughput of state-of-the-art ViT-L @ 512 and ViT-H @ 518 models on images and 2.2x the throughput of ViT-L on video with only a 0.2-0.3% accuracy drop in each case. ToMe can also easily be applied during training, improving in practice training speed up to 2x for MAE fine-tuning on video. Training with ToMe further minimizes accuracy drop, leading to 2x the throughput of ViT-B on audio for only a 0.4% mAP drop. Qualitatively, we find that ToMe merges object parts into one token, even over multiple frames of video. Overall, ToMe's accuracy and speed are competitive with state-of-the-art on images, video, and audio.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Image Classification | ImageNet-1K (with DeiT-S) | GFLOPs | 3.4 | ToMe ($r=8$) |
| Image Classification | ImageNet-1K (with DeiT-S) | Top 1 Accuracy | 79.7 | ToMe ($r=8$) |
| Image Classification | ImageNet-1K (with DeiT-S) | GFLOPs | 2.7 | ToMe ($r=13$) |
| Image Classification | ImageNet-1K (with DeiT-S) | Top 1 Accuracy | 79.4 | ToMe ($r=13$) |
| Image Classification | ImageNet-1K (with DeiT-S) | GFLOPs | 2.3 | ToMe ($r=16$) |
| Image Classification | ImageNet-1K (with DeiT-S) | Top 1 Accuracy | 79.1 | ToMe ($r=16$) |
| Image Classification | ImageNet-1K (with DeiT-T) | GFLOPs | 0.9 | ToMe ($r=8$) |
| Image Classification | ImageNet-1K (with DeiT-T) | Top 1 Accuracy | 71.7 | ToMe ($r=8$) |
| Image Classification | ImageNet-1K (with DeiT-T) | GFLOPs | 0.8 | ToMe ($r=12$) |
| Image Classification | ImageNet-1K (with DeiT-T) | Top 1 Accuracy | 71.4 | ToMe ($r=12$) |
| Image Classification | ImageNet-1K (with DeiT-T) | GFLOPs | 0.6 | ToMe ($r=16$) |
| Image Classification | ImageNet-1K (with DeiT-T) | Top 1 Accuracy | 70.7 | ToMe ($r=16$) |