TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/DeiT-LT Distillation Strikes Back for Vision Transformer T...

DeiT-LT Distillation Strikes Back for Vision Transformer Training on Long-Tailed Datasets

Harsh Rangwani, Pradipto Mondal, Mayank Mishra, Ashish Ramayee Asokan, R. Venkatesh Babu

2024-04-03Image ClassificationLong-tail Learning
PaperPDFCodeCode(official)

Abstract

Vision Transformer (ViT) has emerged as a prominent architecture for various computer vision tasks. In ViT, we divide the input image into patch tokens and process them through a stack of self attention blocks. However, unlike Convolutional Neural Networks (CNN), ViTs simple architecture has no informative inductive bias (e.g., locality,etc. ). Due to this, ViT requires a large amount of data for pre-training. Various data efficient approaches (DeiT) have been proposed to train ViT on balanced datasets effectively. However, limited literature discusses the use of ViT for datasets with long-tailed imbalances. In this work, we introduce DeiT-LT to tackle the problem of training ViTs from scratch on long-tailed datasets. In DeiT-LT, we introduce an efficient and effective way of distillation from CNN via distillation DIST token by using out-of-distribution images and re-weighting the distillation loss to enhance focus on tail classes. This leads to the learning of local CNN-like features in early ViT blocks, improving generalization for tail classes. Further, to mitigate overfitting, we propose distilling from a flat CNN teacher, which leads to learning low-rank generalizable features for DIST tokens across all ViT blocks. With the proposed DeiT-LT scheme, the distillation DIST token becomes an expert on the tail classes, and the classifier CLS token becomes an expert on the head classes. The experts help to effectively learn features corresponding to both the majority and minority classes using a distinct set of tokens within the same ViT architecture. We show the effectiveness of DeiT-LT for training ViT from scratch on datasets ranging from small-scale CIFAR-10 LT to large-scale iNaturalist-2018.

Results

TaskDatasetMetricValueModel
Image ClassificationiNaturalistOverall75.1b_22DeiT-LT(ours)
Image ClassificationCIFAR-100-LT (ρ=50)Error Rate39.5DeiT-LT
Image ClassificationImageNet-LTTop-1 Accuracy59.1DeiT-LT
Image ClassificationCIFAR-10-LT (ρ=50)Error Rate10.2DeiT-LT
Image ClassificationCIFAR-100-LT (ρ=100)Error Rate44.4DeiT-LT
Image ClassificationCIFAR-10-LT (ρ=100)Error Rate12.5DeiT-LT
Few-Shot Image ClassificationCIFAR-100-LT (ρ=50)Error Rate39.5DeiT-LT
Few-Shot Image ClassificationImageNet-LTTop-1 Accuracy59.1DeiT-LT
Few-Shot Image ClassificationCIFAR-10-LT (ρ=50)Error Rate10.2DeiT-LT
Few-Shot Image ClassificationCIFAR-100-LT (ρ=100)Error Rate44.4DeiT-LT
Few-Shot Image ClassificationCIFAR-10-LT (ρ=100)Error Rate12.5DeiT-LT
Generalized Few-Shot ClassificationCIFAR-100-LT (ρ=50)Error Rate39.5DeiT-LT
Generalized Few-Shot ClassificationImageNet-LTTop-1 Accuracy59.1DeiT-LT
Generalized Few-Shot ClassificationCIFAR-10-LT (ρ=50)Error Rate10.2DeiT-LT
Generalized Few-Shot ClassificationCIFAR-100-LT (ρ=100)Error Rate44.4DeiT-LT
Generalized Few-Shot ClassificationCIFAR-10-LT (ρ=100)Error Rate12.5DeiT-LT
Long-tail LearningCIFAR-100-LT (ρ=50)Error Rate39.5DeiT-LT
Long-tail LearningImageNet-LTTop-1 Accuracy59.1DeiT-LT
Long-tail LearningCIFAR-10-LT (ρ=50)Error Rate10.2DeiT-LT
Long-tail LearningCIFAR-100-LT (ρ=100)Error Rate44.4DeiT-LT
Long-tail LearningCIFAR-10-LT (ρ=100)Error Rate12.5DeiT-LT
Generalized Few-Shot LearningCIFAR-100-LT (ρ=50)Error Rate39.5DeiT-LT
Generalized Few-Shot LearningImageNet-LTTop-1 Accuracy59.1DeiT-LT
Generalized Few-Shot LearningCIFAR-10-LT (ρ=50)Error Rate10.2DeiT-LT
Generalized Few-Shot LearningCIFAR-100-LT (ρ=100)Error Rate44.4DeiT-LT
Generalized Few-Shot LearningCIFAR-10-LT (ρ=100)Error Rate12.5DeiT-LT

Related Papers

Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17Federated Learning for Commercial Image Sources2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17Hashed Watermark as a Filter: Defeating Forging and Overwriting Attacks in Weight-based Neural Network Watermarking2025-07-15Transferring Styles for Reduced Texture Bias and Improved Robustness in Semantic Segmentation Networks2025-07-14FedGSCA: Medical Federated Learning with Global Sample Selector and Client Adaptive Adjuster under Label Noise2025-07-13