TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Masked Feature Prediction for Self-Supervised Visual Pre-T...

Masked Feature Prediction for Self-Supervised Visual Pre-Training

Chen Wei, Haoqi Fan, Saining Xie, Chao-yuan Wu, Alan Yuille, Christoph Feichtenhofer

2021-12-16CVPR 2022 1Self-Supervised Image ClassificationAction ClassificationPredictionAction Recognition
PaperPDFCodeCode(official)CodeCodeCodeCode

Abstract

We present Masked Feature Prediction (MaskFeat) for self-supervised pre-training of video models. Our approach first randomly masks out a portion of the input sequence and then predicts the feature of the masked regions. We study five different types of features and find Histograms of Oriented Gradients (HOG), a hand-crafted feature descriptor, works particularly well in terms of both performance and efficiency. We observe that the local contrast normalization in HOG is essential for good results, which is in line with earlier work using HOG for visual recognition. Our approach can learn abundant visual knowledge and drive large-scale Transformer-based models. Without using extra model weights or supervision, MaskFeat pre-trained on unlabeled videos achieves unprecedented results of 86.7% with MViT-L on Kinetics-400, 88.3% on Kinetics-600, 80.4% on Kinetics-700, 39.8 mAP on AVA, and 75.0% on SSv2. MaskFeat further generalizes to image input, which can be interpreted as a video with a single frame and obtains competitive results on ImageNet.

Results

TaskDatasetMetricValueModel
VideoKinetics-700Top-1 Accuracy80.4MaskFeat (no extra data, MViT-L)
VideoKinetics-700Top-5 Accuracy95.7MaskFeat (no extra data, MViT-L)
VideoKinetics-400Acc@187MaskFeat (K600, MViT-L)
VideoKinetics-400Acc@597.4MaskFeat (K600, MViT-L)
VideoKinetics-400Acc@186.7MaskFeat (no extra data, MViT-L)
VideoKinetics-400Acc@597.3MaskFeat (no extra data, MViT-L)
VideoKinetics-600Top-1 Accuracy88.3MaskFeat (no extra data, MViT-L)
VideoKinetics-600Top-5 Accuracy98MaskFeat (no extra data, MViT-L)
Activity RecognitionSomething-Something V2Parameters218MaskFeat (Kinetics600 pretrain, MViT-L)
Activity RecognitionSomething-Something V2Top-1 Accuracy75MaskFeat (Kinetics600 pretrain, MViT-L)
Activity RecognitionSomething-Something V2Top-5 Accuracy95MaskFeat (Kinetics600 pretrain, MViT-L)
Activity RecognitionAVA v2.2mAP39.8MaskFeat (Kinetics-600 pretrain, MViT-L)
Action RecognitionSomething-Something V2Parameters218MaskFeat (Kinetics600 pretrain, MViT-L)
Action RecognitionSomething-Something V2Top-1 Accuracy75MaskFeat (Kinetics600 pretrain, MViT-L)
Action RecognitionSomething-Something V2Top-5 Accuracy95MaskFeat (Kinetics600 pretrain, MViT-L)
Action RecognitionAVA v2.2mAP39.8MaskFeat (Kinetics-600 pretrain, MViT-L)

Related Papers

Multi-Strategy Improved Snake Optimizer Accelerated CNN-LSTM-Attention-Adaboost for Trajectory Prediction2025-07-21A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17Generative Click-through Rate Prediction with Applications to Search Advertising2025-07-15Conformation-Aware Structure Prediction of Antigen-Recognizing Immune Proteins2025-07-11Foundation models for time series forecasting: Application in conformal prediction2025-07-09Predicting Graph Structure via Adapted Flux Balance Analysis2025-07-08Speech Quality Assessment Model Based on Mixture of Experts: System-Level Performance Enhancement and Utterance-Level Challenge Analysis2025-07-08A Wireless Foundation Model for Multi-Task Prediction2025-07-08