ImageNet-21K Pretraining for the Masses

Tal Ridnik, Emanuel Ben-Baruch, Asaf Noy, Lihi Zelnik-Manor

2021-04-22Image Classification Action Recognition Multi-Label Classification Fine-Grained Image Classification

Paper PDF Code(official)Code Code Code Code

Abstract

ImageNet-1K serves as the primary dataset for pretraining deep learning models for computer vision tasks. ImageNet-21K dataset, which is bigger and more diverse, is used less frequently for pretraining, mainly due to its complexity, low accessibility, and underestimation of its added value. This paper aims to close this gap, and make high-quality efficient pretraining on ImageNet-21K available for everyone. Via a dedicated preprocessing stage, utilization of WordNet hierarchical structure, and a novel training scheme called semantic softmax, we show that various models significantly benefit from ImageNet-21K pretraining on numerous datasets and tasks, including small mobile-oriented models. We also show that we outperform previous ImageNet-21K pretraining schemes for prominent new models like ViT and Mixer. Our proposed pretraining pipeline is efficient, accessible, and leads to SoTA reproducible results, from a publicly available dataset. The training code and pretrained models are available at: https://github.com/Alibaba-MIIL/ImageNet21K

Results

Task	Dataset	Metric	Value	Model
Multi-Label Classification	MS-COCO	mAP	89.8	TResNet-L-V2, (ImageNet-21K-P pretraining, resolution 640)
Multi-Label Classification	MS-COCO	mAP	88.4	TResNet-L-V2, (ImageNet-21K-P pretraining, resolution 448)
Multi-Label Classification	PASCAL VOC 2007	mAP	93.1	ViT-B-16 (ImageNet-21K pretrained)
Image Classification	Stanford Cars	Accuracy	96.32	TResNet-L-V2
Image Classification	CIFAR-100	Percentage correct	94.2	ViT-B-16 (ImageNet-21K-P pretrain)

Related Papers

Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18 Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17 Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17 Federated Learning for Commercial Image Sources2025-07-17 MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17 A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17 Hashed Watermark as a Filter: Defeating Forging and Overwriting Attacks in Weight-based Neural Network Watermarking2025-07-15 Transferring Styles for Reduced Texture Bias and Improved Robustness in Semantic Segmentation Networks2025-07-14