TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/The effectiveness of MAE pre-pretraining for billion-scale...

The effectiveness of MAE pre-pretraining for billion-scale pretraining

Mannat Singh, Quentin Duval, Kalyan Vasudev Alwala, Haoqi Fan, Vaibhav Aggarwal, Aaron Adcock, Armand Joulin, Piotr Dollár, Christoph Feichtenhofer, Ross Girshick, Rohit Girdhar, Ishan Misra

2023-03-23ICCV 2023 1Image ClassificationAction ClassificationVideo RecognitionFew-Shot Image ClassificationZero-Shot Transfer Image ClassificationVideo ClassificationAction RecognitionZero-Shot LearningObject Detection
PaperPDFCode(official)

Abstract

This paper revisits the standard pretrain-then-finetune paradigm used in computer vision for visual recognition tasks. Typically, state-of-the-art foundation models are pretrained using large scale (weakly) supervised datasets with billions of images. We introduce an additional pre-pretraining stage that is simple and uses the self-supervised MAE technique to initialize the model. While MAE has only been shown to scale with the size of models, we find that it scales with the size of the training dataset as well. Thus, our MAE-based pre-pretraining scales with both model and data size making it applicable for training foundation models. Pre-pretraining consistently improves both the model convergence and the downstream transfer performance across a range of model scales (millions to billions of parameters), and dataset sizes (millions to billions of images). We measure the effectiveness of pre-pretraining on 10 different visual recognition tasks spanning image classification, video recognition, object detection, low-shot classification and zero-shot recognition. Our largest model achieves new state-of-the-art results on iNaturalist-18 (91.7%), ImageNet-ReaL (91.1%), 1-shot ImageNet-1k (63.6%), and zero-shot transfer on Food-101 (96.2%). Our study reveals that model initialization plays a significant role, even for web-scale pretraining with billions of images, and our models are available publicly.

Results

TaskDatasetMetricValueModel
Activity RecognitionSomething-Something V2Top-1 Accuracy74.4MAWS (ViT-L)
Image ClassificationImageNet V2Top 1 Accuracy84MAWS (ViT-6.5B)
Image ClassificationImageNet V2Top 1 Accuracy83MAWS (ViT-2B)
Image ClassificationObjectNetTop-1 Accuracy77.9MAWS (ViT-6.5B)
Image ClassificationObjectNetTop-1 Accuracy75.8MAWS (ViT-2B)
Image ClassificationObjectNetTop-1 Accuracy72.6MAWS (ViT-H)
Image ClassificationImageNet - 5-shotTop 1 Accuracy82.6MAWS (ViT-6.5B)
Image ClassificationImageNet - 5-shotTop 1 Accuracy81.5MAWS (ViT-2B)
Image ClassificationImageNet - 5-shotTop 1 Accuracy79.8MAWS (ViT-H)
Image ClassificationiNaturalist 2018 - 1-shotTop 1 Accuracy35.5MAWS (ViT-2B)
Image ClassificationImageNet - 10-shotTop 1 Accuracy84.6MAWS (ViT-6.5B)
Image ClassificationImageNet - 10-shotTop 1 Accuracy83.7MAWS (ViT-2B)
Image ClassificationImageNet - 10-shotTop 1 Accuracy82.5MAWS (ViT-H)
Image ClassificationImageNet - 1-shotTop 1 Accuracy63.6MAWS (ViT-6.5B)
Image ClassificationImageNet - 1-shotTop 1 Accuracy62.1MAWS (ViT-2B)
Image ClassificationImageNet - 1-shotTop 1 Accuracy57.1MAWS (ViT-H)
Image ClassificationiNaturalist 2018 - 5-shotTop 1 Accuracy72.8MAWS (ViT-2B)
Image ClassificationiNaturalist 2018 - 10-shotTop 1 Accuracy80.3MAWS (ViT-2B)
Action RecognitionSomething-Something V2Top-1 Accuracy74.4MAWS (ViT-L)
Few-Shot Image ClassificationImageNet - 5-shotTop 1 Accuracy82.6MAWS (ViT-6.5B)
Few-Shot Image ClassificationImageNet - 5-shotTop 1 Accuracy81.5MAWS (ViT-2B)
Few-Shot Image ClassificationImageNet - 5-shotTop 1 Accuracy79.8MAWS (ViT-H)
Few-Shot Image ClassificationiNaturalist 2018 - 1-shotTop 1 Accuracy35.5MAWS (ViT-2B)
Few-Shot Image ClassificationImageNet - 10-shotTop 1 Accuracy84.6MAWS (ViT-6.5B)
Few-Shot Image ClassificationImageNet - 10-shotTop 1 Accuracy83.7MAWS (ViT-2B)
Few-Shot Image ClassificationImageNet - 10-shotTop 1 Accuracy82.5MAWS (ViT-H)
Few-Shot Image ClassificationImageNet - 1-shotTop 1 Accuracy63.6MAWS (ViT-6.5B)
Few-Shot Image ClassificationImageNet - 1-shotTop 1 Accuracy62.1MAWS (ViT-2B)
Few-Shot Image ClassificationImageNet - 1-shotTop 1 Accuracy57.1MAWS (ViT-H)
Few-Shot Image ClassificationiNaturalist 2018 - 5-shotTop 1 Accuracy72.8MAWS (ViT-2B)
Few-Shot Image ClassificationiNaturalist 2018 - 10-shotTop 1 Accuracy80.3MAWS (ViT-2B)
Zero-Shot Transfer Image ClassificationImageNetAccuracy (Private)82.1MAWS (ViT-2B)
Zero-Shot Transfer Image ClassificationImageNetAccuracy (Private)81.1MAWS (ViT-H)
Zero-Shot Transfer Image ClassificationFood-101Top 1 Accuracy96.2MAWS (ViT-2B)

Related Papers

Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17Federated Learning for Commercial Image Sources2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17GLAD: Generalizable Tuning for Vision-Language Models2025-07-17RS-TinyNet: Stage-wise Feature Fusion Network for Detecting Tiny Objects in Remote Sensing Images2025-07-17