Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, Neil Houlsby
Transfer of pre-trained representations improves sample efficiency and simplifies hyperparameter tuning when training deep neural networks for vision. We revisit the paradigm of pre-training on large supervised datasets and fine-tuning the model on a target task. We scale up pre-training, and propose a simple recipe that we call Big Transfer (BiT). By combining a few carefully selected components, and transferring using a simple heuristic, we achieve strong performance on over 20 datasets. BiT performs well across a surprisingly wide range of data regimes -- from 1 example per class to 1M total examples. BiT achieves 87.5% top-1 accuracy on ILSVRC-2012, 99.4% on CIFAR-10, and 76.3% on the 19 task Visual Task Adaptation Benchmark (VTAB). On small datasets, BiT attains 76.8% on ILSVRC-2012 with 10 examples per class, and 97.0% on CIFAR-10 with 10 examples per class. We conduct detailed analysis of the main components that lead to high transfer performance.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Image Classification | OmniBenchmark | Average Top-1 Accuracy | 40.4 | BiT-M |
| Image Classification | ObjectNet | Top-1 Accuracy | 58.7 | BiT-L (ResNet-152x4) |
| Image Classification | ObjectNet | Top-5 Accuracy | 80 | BiT-L (ResNet-152x4) |
| Image Classification | ObjectNet | Top-1 Accuracy | 47 | BiT-M (ResNet-152x4) |
| Image Classification | ObjectNet | Top-5 Accuracy | 69 | BiT-M (ResNet-152x4) |
| Image Classification | ObjectNet | Top-1 Accuracy | 36 | BiT-S (ResNet-152x4) |
| Image Classification | ObjectNet | Top-5 Accuracy | 57 | BiT-S (ResNet-152x4) |
| Image Classification | CIFAR-10 | Percentage correct | 99.37 | BiT-L (ResNet) |
| Image Classification | CIFAR-10 | Percentage correct | 98.91 | BiT-M (ResNet) |
| Image Classification | VTAB-1k | Top-1 Accuracy | 78.72 | BiT-L (50 hypers/task) |
| Image Classification | VTAB-1k | Top-1 Accuracy | 76.3 | BiT-L |
| Image Classification | VTAB-1k | Top-1 Accuracy | 70.6 | BiT-M |
| Image Classification | VTAB-1k | Top-1 Accuracy | 66.9 | BiT-S |
| Image Classification | Flowers-102 | Accuracy | 99.63 | BiT-L (ResNet) |
| Image Classification | Flowers-102 | Accuracy | 99.3 | BiT-M (ResNet) |
| Image Classification | ObjectNet (Bounding Box) | Top 5 Accuracy | 85.1 | BiT-L (ResNet) |
| Image Classification | ObjectNet (Bounding Box) | Top 5 Accuracy | 76 | BiT-M (ResNet) |
| Image Classification | ObjectNet (Bounding Box) | Top 5 Accuracy | 64.4 | BiT-S (ResNet) |
| Image Classification | CIFAR-100 | Percentage correct | 93.51 | BiT-L (ResNet) |
| Image Classification | CIFAR-100 | Percentage correct | 92.17 | BiT-M (ResNet) |
| Image Classification | ImageNet | Top 5 Accuracy | 98.46 | BiT-L (ResNet) |
| Image Classification | Oxford-IIIT Pets | Accuracy | 96.62 | BiT-L (ResNet) |
| Image Classification | Oxford-IIIT Pets | Accuracy | 94.47 | BiT-M (ResNet) |
| Image Classification | Oxford 102 Flowers | Top-1 Error Rate | 0.37 | BiT-L (ResNet) |
| Image Classification | Oxford 102 Flowers | Top-1 Error Rate | 0.7 | BiT-M (ResNet) |
| Fine-Grained Image Classification | Oxford-IIIT Pets | Accuracy | 96.62 | BiT-L (ResNet) |
| Fine-Grained Image Classification | Oxford-IIIT Pets | Accuracy | 94.47 | BiT-M (ResNet) |
| Fine-Grained Image Classification | Oxford 102 Flowers | Top-1 Error Rate | 0.37 | BiT-L (ResNet) |
| Fine-Grained Image Classification | Oxford 102 Flowers | Top-1 Error Rate | 0.7 | BiT-M (ResNet) |