TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/All Tokens Matter: Token Labeling for Training Better Visi...

All Tokens Matter: Token Labeling for Training Better Vision Transformers

Zihang Jiang, Qibin Hou, Li Yuan, Daquan Zhou, Yujun Shi, Xiaojie Jin, Anran Wang, Jiashi Feng

2021-04-22NeurIPS 2021 12Image ClassificationSemantic SegmentationAllGeneral Classification
PaperPDFCodeCodeCodeCodeCodeCode(official)Code

Abstract

In this paper, we present token labeling -- a new training objective for training high-performance vision transformers (ViTs). Different from the standard training objective of ViTs that computes the classification loss on an additional trainable class token, our proposed one takes advantage of all the image patch tokens to compute the training loss in a dense manner. Specifically, token labeling reformulates the image classification problem into multiple token-level recognition problems and assigns each patch token with an individual location-specific supervision generated by a machine annotator. Experiments show that token labeling can clearly and consistently improve the performance of various ViT models across a wide spectrum. For a vision transformer with 26M learnable parameters serving as an example, with token labeling, the model can achieve 84.4% Top-1 accuracy on ImageNet. The result can be further increased to 86.4% by slightly scaling the model size up to 150M, delivering the minimal-sized model among previous models (250M+) reaching 86%. We also show that token labeling can clearly improve the generalization capability of the pre-trained models on downstream tasks with dense prediction, such as semantic segmentation. Our code and all the training details will be made publicly available at https://github.com/zihangJiang/TokenLabeling.

Results

TaskDatasetMetricValueModel
Semantic SegmentationADE20KParams (M)209LV-ViT-L (UperNet, MS)
Semantic SegmentationADE20KValidation mIoU51.8LV-ViT-L (UperNet, MS)
Image ClassificationImageNetGFLOPs214.8LV-ViT-L
Image ClassificationImageNetGFLOPs16LV-ViT-M
Image ClassificationImageNetGFLOPs6.6LV-ViT-S
Image ClassificationImageNet-1K (With LV-ViT-S)GFLOPs6.6Base (LV-ViT-S)
Image ClassificationImageNet-1K (With LV-ViT-S)Top 1 Accuracy83.3Base (LV-ViT-S)
10-shot image generationADE20KParams (M)209LV-ViT-L (UperNet, MS)
10-shot image generationADE20KValidation mIoU51.8LV-ViT-L (UperNet, MS)

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17Federated Learning for Commercial Image Sources2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17