Mannat Singh, Laura Gustafson, Aaron Adcock, Vinicius de Freitas Reis, Bugra Gedik, Raj Prateek Kosaraju, Dhruv Mahajan, Ross Girshick, Piotr Dollár, Laurens van der Maaten
Model pre-training is a cornerstone of modern visual recognition systems. Although fully supervised pre-training on datasets like ImageNet is still the de-facto standard, recent studies suggest that large-scale weakly supervised pre-training can outperform fully supervised approaches. This paper revisits weakly-supervised pre-training of models using hashtag supervision with modern versions of residual networks and the largest-ever dataset of images and corresponding hashtags. We study the performance of the resulting models in various transfer-learning settings including zero-shot transfer. We also compare our models with those obtained via large-scale self-supervised learning. We find our weakly-supervised models to be very competitive across all settings, and find they substantially outperform their self-supervised counterparts. We also include an investigation into whether our models learned potentially troubling associations or stereotypes. Overall, our results provide a compelling argument for the use of weakly supervised learning in the development of visual recognition systems. Our models, Supervised Weakly through hashtAGs (SWAG), are available publicly.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Image Classification | ImageNet V2 | Top 1 Accuracy | 81.1 | SWAG (ViT H/14) |
| Image Classification | Places365-Standard | Top 1 Accuracy | 60.7 | SWAG (ViT H/14) |
| Image Classification | ObjectNet | Top-1 Accuracy | 69.5 | SWAG (ViT H/14) |
| Image Classification | ObjectNet | Top-1 Accuracy | 64.3 | RegNetY 128GF (Platt) |
| Image Classification | ObjectNet | Top-1 Accuracy | 60 | ViT H/14 (Platt) |
| Image Classification | ObjectNet | Top-1 Accuracy | 57.3 | ViT L/16 (Platt) |
| Image Classification | ObjectNet | Top-1 Accuracy | 48.9 | ViT B/16 |
| Image Classification | ImageNet | GFLOPs | 1018.8 | SWAG (ViT H/14) |
| Image Classification | CUB-200-2011 | Accuracy | 91.7 | SWAG (ViT H/14) |
| Fine-Grained Image Classification | CUB-200-2011 | Accuracy | 91.7 | SWAG (ViT H/14) |