The effectiveness of MAE pre-pretraining for billion-scale pretraining

Mannat Singh, Quentin Duval, Kalyan Vasudev Alwala, Haoqi Fan, Vaibhav Aggarwal, Aaron Adcock, Armand Joulin, Piotr Dollár, Christoph Feichtenhofer, Ross Girshick, Rohit Girdhar, Ishan Misra

2023-03-23ICCV 2023 1Image Classification Action Classification Video Recognition Few-Shot Image Classification Zero-Shot Transfer Image Classification Video Classification Action Recognition Zero-Shot Learning Object Detection

Paper PDF Code(official)

Abstract

This paper revisits the standard pretrain-then-finetune paradigm used in computer vision for visual recognition tasks. Typically, state-of-the-art foundation models are pretrained using large scale (weakly) supervised datasets with billions of images. We introduce an additional pre-pretraining stage that is simple and uses the self-supervised MAE technique to initialize the model. While MAE has only been shown to scale with the size of models, we find that it scales with the size of the training dataset as well. Thus, our MAE-based pre-pretraining scales with both model and data size making it applicable for training foundation models. Pre-pretraining consistently improves both the model convergence and the downstream transfer performance across a range of model scales (millions to billions of parameters), and dataset sizes (millions to billions of images). We measure the effectiveness of pre-pretraining on 10 different visual recognition tasks spanning image classification, video recognition, object detection, low-shot classification and zero-shot recognition. Our largest model achieves new state-of-the-art results on iNaturalist-18 (91.7%), ImageNet-ReaL (91.1%), 1-shot ImageNet-1k (63.6%), and zero-shot transfer on Food-101 (96.2%). Our study reveals that model initialization plays a significant role, even for web-scale pretraining with billions of images, and our models are available publicly.

Results

Task	Dataset	Metric	Value	Model
Activity Recognition	Something-Something V2	Top-1 Accuracy	74.4	MAWS (ViT-L)
Image Classification	ImageNet V2	Top 1 Accuracy	84	MAWS (ViT-6.5B)
Image Classification	ImageNet V2	Top 1 Accuracy	83	MAWS (ViT-2B)
Image Classification	ObjectNet	Top-1 Accuracy	77.9	MAWS (ViT-6.5B)
Image Classification	ObjectNet	Top-1 Accuracy	75.8	MAWS (ViT-2B)
Image Classification	ObjectNet	Top-1 Accuracy	72.6	MAWS (ViT-H)
Image Classification	ImageNet - 5-shot	Top 1 Accuracy	82.6	MAWS (ViT-6.5B)
Image Classification	ImageNet - 5-shot	Top 1 Accuracy	81.5	MAWS (ViT-2B)
Image Classification	ImageNet - 5-shot	Top 1 Accuracy	79.8	MAWS (ViT-H)
Image Classification	iNaturalist 2018 - 1-shot	Top 1 Accuracy	35.5	MAWS (ViT-2B)
Image Classification	ImageNet - 10-shot	Top 1 Accuracy	84.6	MAWS (ViT-6.5B)
Image Classification	ImageNet - 10-shot	Top 1 Accuracy	83.7	MAWS (ViT-2B)
Image Classification	ImageNet - 10-shot	Top 1 Accuracy	82.5	MAWS (ViT-H)
Image Classification	ImageNet - 1-shot	Top 1 Accuracy	63.6	MAWS (ViT-6.5B)
Image Classification	ImageNet - 1-shot	Top 1 Accuracy	62.1	MAWS (ViT-2B)
Image Classification	ImageNet - 1-shot	Top 1 Accuracy	57.1	MAWS (ViT-H)
Image Classification	iNaturalist 2018 - 5-shot	Top 1 Accuracy	72.8	MAWS (ViT-2B)
Image Classification	iNaturalist 2018 - 10-shot	Top 1 Accuracy	80.3	MAWS (ViT-2B)
Action Recognition	Something-Something V2	Top-1 Accuracy	74.4	MAWS (ViT-L)
Few-Shot Image Classification	ImageNet - 5-shot	Top 1 Accuracy	82.6	MAWS (ViT-6.5B)
Few-Shot Image Classification	ImageNet - 5-shot	Top 1 Accuracy	81.5	MAWS (ViT-2B)
Few-Shot Image Classification	ImageNet - 5-shot	Top 1 Accuracy	79.8	MAWS (ViT-H)
Few-Shot Image Classification	iNaturalist 2018 - 1-shot	Top 1 Accuracy	35.5	MAWS (ViT-2B)
Few-Shot Image Classification	ImageNet - 10-shot	Top 1 Accuracy	84.6	MAWS (ViT-6.5B)
Few-Shot Image Classification	ImageNet - 10-shot	Top 1 Accuracy	83.7	MAWS (ViT-2B)
Few-Shot Image Classification	ImageNet - 10-shot	Top 1 Accuracy	82.5	MAWS (ViT-H)
Few-Shot Image Classification	ImageNet - 1-shot	Top 1 Accuracy	63.6	MAWS (ViT-6.5B)
Few-Shot Image Classification	ImageNet - 1-shot	Top 1 Accuracy	62.1	MAWS (ViT-2B)
Few-Shot Image Classification	ImageNet - 1-shot	Top 1 Accuracy	57.1	MAWS (ViT-H)
Few-Shot Image Classification	iNaturalist 2018 - 5-shot	Top 1 Accuracy	72.8	MAWS (ViT-2B)
Few-Shot Image Classification	iNaturalist 2018 - 10-shot	Top 1 Accuracy	80.3	MAWS (ViT-2B)
Zero-Shot Transfer Image Classification	ImageNet	Accuracy (Private)	82.1	MAWS (ViT-2B)
Zero-Shot Transfer Image Classification	ImageNet	Accuracy (Private)	81.1	MAWS (ViT-H)
Zero-Shot Transfer Image Classification	Food-101	Top 1 Accuracy	96.2	MAWS (ViT-2B)

The effectiveness of MAE pre-pretraining for billion-scale pretraining

Abstract

Results

Related Papers

The effectiveness of MAE pre-pretraining for billion-scale pretraining

Abstract

Results

Related Papers