Masked Feature Prediction for Self-Supervised Visual Pre-Training

Chen Wei, Haoqi Fan, Saining Xie, Chao-yuan Wu, Alan Yuille, Christoph Feichtenhofer

2021-12-16CVPR 2022 1Self-Supervised Image Classification Action Classification Prediction Action Recognition

Paper PDF Code Code(official)Code Code Code Code

Abstract

We present Masked Feature Prediction (MaskFeat) for self-supervised pre-training of video models. Our approach first randomly masks out a portion of the input sequence and then predicts the feature of the masked regions. We study five different types of features and find Histograms of Oriented Gradients (HOG), a hand-crafted feature descriptor, works particularly well in terms of both performance and efficiency. We observe that the local contrast normalization in HOG is essential for good results, which is in line with earlier work using HOG for visual recognition. Our approach can learn abundant visual knowledge and drive large-scale Transformer-based models. Without using extra model weights or supervision, MaskFeat pre-trained on unlabeled videos achieves unprecedented results of 86.7% with MViT-L on Kinetics-400, 88.3% on Kinetics-600, 80.4% on Kinetics-700, 39.8 mAP on AVA, and 75.0% on SSv2. MaskFeat further generalizes to image input, which can be interpreted as a video with a single frame and obtains competitive results on ImageNet.

Results

Task	Dataset	Metric	Value	Model
Video	Kinetics-700	Top-1 Accuracy	80.4	MaskFeat (no extra data, MViT-L)
Video	Kinetics-700	Top-5 Accuracy	95.7	MaskFeat (no extra data, MViT-L)
Video	Kinetics-400	Acc@1	87	MaskFeat (K600, MViT-L)
Video	Kinetics-400	Acc@5	97.4	MaskFeat (K600, MViT-L)
Video	Kinetics-400	Acc@1	86.7	MaskFeat (no extra data, MViT-L)
Video	Kinetics-400	Acc@5	97.3	MaskFeat (no extra data, MViT-L)
Video	Kinetics-600	Top-1 Accuracy	88.3	MaskFeat (no extra data, MViT-L)
Video	Kinetics-600	Top-5 Accuracy	98	MaskFeat (no extra data, MViT-L)
Activity Recognition	Something-Something V2	Parameters	218	MaskFeat (Kinetics600 pretrain, MViT-L)
Activity Recognition	Something-Something V2	Top-1 Accuracy	75	MaskFeat (Kinetics600 pretrain, MViT-L)
Activity Recognition	Something-Something V2	Top-5 Accuracy	95	MaskFeat (Kinetics600 pretrain, MViT-L)
Activity Recognition	AVA v2.2	mAP	39.8	MaskFeat (Kinetics-600 pretrain, MViT-L)
Action Recognition	Something-Something V2	Parameters	218	MaskFeat (Kinetics600 pretrain, MViT-L)
Action Recognition	Something-Something V2	Top-1 Accuracy	75	MaskFeat (Kinetics600 pretrain, MViT-L)
Action Recognition	Something-Something V2	Top-5 Accuracy	95	MaskFeat (Kinetics600 pretrain, MViT-L)
Action Recognition	AVA v2.2	mAP	39.8	MaskFeat (Kinetics-600 pretrain, MViT-L)

Masked Feature Prediction for Self-Supervised Visual Pre-Training

Abstract

Results

Related Papers

Masked Feature Prediction for Self-Supervised Visual Pre-Training

Abstract

Results

Related Papers