TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/VideoMAE: Masked Autoencoders are Data-Efficient Learners ...

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

Zhan Tong, Yibing Song, Jue Wang, LiMin Wang

2022-03-23Self-Supervised Action Recognition LinearAction Classification4kVideo ReconstructionVideo UnderstandingAction RecognitionSelf-Supervised Action Recognition
PaperPDFCode(official)CodeCodeCodeCodeCode(official)CodeCodeCode

Abstract

Pre-training video transformers on extra large-scale datasets is generally required to achieve premier performance on relatively small datasets. In this paper, we show that video masked autoencoders (VideoMAE) are data-efficient learners for self-supervised video pre-training (SSVP). We are inspired by the recent ImageMAE and propose customized video tube masking with an extremely high ratio. This simple design makes video reconstruction a more challenging self-supervision task, thus encouraging extracting more effective video representations during this pre-training process. We obtain three important findings on SSVP: (1) An extremely high proportion of masking ratio (i.e., 90% to 95%) still yields favorable performance of VideoMAE. The temporally redundant video content enables a higher masking ratio than that of images. (2) VideoMAE achieves impressive results on very small datasets (i.e., around 3k-4k videos) without using any extra data. (3) VideoMAE shows that data quality is more important than data quantity for SSVP. Domain shift between pre-training and target datasets is an important issue. Notably, our VideoMAE with the vanilla ViT can achieve 87.4% on Kinetics-400, 75.4% on Something-Something V2, 91.3% on UCF101, and 62.6% on HMDB51, without using any extra data. Code is available at https://github.com/MCG-NJU/VideoMAE.

Results

TaskDatasetMetricValueModel
VideoKinetics-400Acc@187.4VideoMAE (no extra data, ViT-H, 32x320x320)
VideoKinetics-400Acc@597.6VideoMAE (no extra data, ViT-H, 32x320x320)
VideoKinetics-400Acc@186.6VideoMAE (no extra data, ViT-H)
VideoKinetics-400Acc@597.1VideoMAE (no extra data, ViT-H)
VideoKinetics-400Acc@186.1VideoMAE (no extra data, ViT-L, 32x320x320)
VideoKinetics-400Acc@597.3VideoMAE (no extra data, ViT-L, 32x320x320)
VideoKinetics-400Acc@185.2VideoMAE (no extra data, ViT-L, 16x4)
VideoKinetics-400Acc@596.8VideoMAE (no extra data, ViT-L, 16x4)
VideoKinetics-400Acc@181.5VideoMAE (no extra data, ViT-B, 16x4)
VideoKinetics-400Acc@595.1VideoMAE (no extra data, ViT-B, 16x4)
Activity RecognitionSomething-Something V2Parameters305VideoMAE (no extra data, ViT-L, 32x2)
Activity RecognitionSomething-Something V2Top-1 Accuracy75.4VideoMAE (no extra data, ViT-L, 32x2)
Activity RecognitionSomething-Something V2Top-5 Accuracy95.2VideoMAE (no extra data, ViT-L, 32x2)
Activity RecognitionSomething-Something V2Parameters305VideoMAE (no extra data, ViT-L, 16frame)
Activity RecognitionSomething-Something V2Top-1 Accuracy74.3VideoMAE (no extra data, ViT-L, 16frame)
Activity RecognitionSomething-Something V2Top-5 Accuracy94.6VideoMAE (no extra data, ViT-L, 16frame)
Activity RecognitionSomething-Something V2Parameters87VideoMAE (no extra data, ViT-B, 16frame)
Activity RecognitionSomething-Something V2Top-1 Accuracy70.8VideoMAE (no extra data, ViT-B, 16frame)
Activity RecognitionSomething-Something V2Top-5 Accuracy92.4VideoMAE (no extra data, ViT-B, 16frame)
Activity RecognitionAVA v2.2mAP39.5VideoMAE (K400 pretrain+finetune, ViT-H, 16x4)
Activity RecognitionAVA v2.2mAP39.3VideoMAE (K700 pretrain+finetune, ViT-L, 16x4)
Activity RecognitionAVA v2.2mAP37.8VideoMAE (K400 pretrain+finetune, ViT-L, 16x4)
Activity RecognitionAVA v2.2mAP36.5VideoMAE (K400 pretrain, ViT-H, 16x4)
Activity RecognitionAVA v2.2mAP36.1VideoMAE (K700 pretrain, ViT-L, 16x4)
Activity RecognitionAVA v2.2mAP34.3VideoMAE (K400 pretrain, ViT-L, 16x4)
Activity RecognitionAVA v2.2mAP31.8VideoMAE (K400 pretrain+finetune, ViT-B, 16x4)
Activity RecognitionAVA v2.2mAP26.7VideoMAE (K400 pretrain, ViT-B, 16x4)
Activity RecognitionUCF1013-fold Accuracy96.1VideoMAE
Activity RecognitionUCF1013-fold Accuracy91.3VideoMAE(no extra data)
Activity RecognitionHMDB51Top-1 Accuracy73.3VideoMAE
Activity RecognitionHMDB51Top-1 Accuracy62.6VideoMAE(no extra data)
Action RecognitionSomething-Something V2Parameters305VideoMAE (no extra data, ViT-L, 32x2)
Action RecognitionSomething-Something V2Top-1 Accuracy75.4VideoMAE (no extra data, ViT-L, 32x2)
Action RecognitionSomething-Something V2Top-5 Accuracy95.2VideoMAE (no extra data, ViT-L, 32x2)
Action RecognitionSomething-Something V2Parameters305VideoMAE (no extra data, ViT-L, 16frame)
Action RecognitionSomething-Something V2Top-1 Accuracy74.3VideoMAE (no extra data, ViT-L, 16frame)
Action RecognitionSomething-Something V2Top-5 Accuracy94.6VideoMAE (no extra data, ViT-L, 16frame)
Action RecognitionSomething-Something V2Parameters87VideoMAE (no extra data, ViT-B, 16frame)
Action RecognitionSomething-Something V2Top-1 Accuracy70.8VideoMAE (no extra data, ViT-B, 16frame)
Action RecognitionSomething-Something V2Top-5 Accuracy92.4VideoMAE (no extra data, ViT-B, 16frame)
Action RecognitionAVA v2.2mAP39.5VideoMAE (K400 pretrain+finetune, ViT-H, 16x4)
Action RecognitionAVA v2.2mAP39.3VideoMAE (K700 pretrain+finetune, ViT-L, 16x4)
Action RecognitionAVA v2.2mAP37.8VideoMAE (K400 pretrain+finetune, ViT-L, 16x4)
Action RecognitionAVA v2.2mAP36.5VideoMAE (K400 pretrain, ViT-H, 16x4)
Action RecognitionAVA v2.2mAP36.1VideoMAE (K700 pretrain, ViT-L, 16x4)
Action RecognitionAVA v2.2mAP34.3VideoMAE (K400 pretrain, ViT-L, 16x4)
Action RecognitionAVA v2.2mAP31.8VideoMAE (K400 pretrain+finetune, ViT-B, 16x4)
Action RecognitionAVA v2.2mAP26.7VideoMAE (K400 pretrain, ViT-B, 16x4)
Action RecognitionUCF1013-fold Accuracy96.1VideoMAE
Action RecognitionUCF1013-fold Accuracy91.3VideoMAE(no extra data)
Action RecognitionHMDB51Top-1 Accuracy73.3VideoMAE
Action RecognitionHMDB51Top-1 Accuracy62.6VideoMAE(no extra data)

Related Papers

VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks2025-07-15EmbRACE-3K: Embodied Reasoning and Action in Complex Environments2025-07-14Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI2025-07-14Dynamic Parameter Memory: Temporary LoRA-Enhanced LLM for Long-Sequence Emotion Recognition in Conversation2025-07-114KAgent: Agentic Any Image to 4K Super-Resolution2025-07-09GSVR: 2D Gaussian-based Video Representation for 800+ FPS with Hybrid Deformation Field2025-07-08