TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/VideoMAE V2: Scaling Video Masked Autoencoders with Dual M...

VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking

LiMin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan He, Yi Wang, Yali Wang, Yu Qiao

2023-03-29CVPR 2023 1Action ClassificationSpatio-Temporal Action LocalizationAction RecognitionAction Recognition In VideosTemporal Action LocalizationSelf-Supervised Action Recognition
PaperPDFCode(official)

Abstract

Scale is the primary factor for building a powerful foundation model that could well generalize to a variety of downstream tasks. However, it is still challenging to train video foundation models with billions of parameters. This paper shows that video masked autoencoder (VideoMAE) is a scalable and general self-supervised pre-trainer for building video foundation models. We scale the VideoMAE in both model and data with a core design. Specifically, we present a dual masking strategy for efficient pre-training, with an encoder operating on a subset of video tokens and a decoder processing another subset of video tokens. Although VideoMAE is very efficient due to high masking ratio in encoder, masking decoder can still further reduce the overall computational cost. This enables the efficient pre-training of billion-level models in video. We also use a progressive training paradigm that involves an initial pre-training on a diverse multi-sourced unlabeled dataset, followed by a post-pre-training on a mixed labeled dataset. Finally, we successfully train a video ViT model with a billion parameters, which achieves a new state-of-the-art performance on the datasets of Kinetics (90.0% on K400 and 89.9% on K600) and Something-Something (68.7% on V1 and 77.0% on V2). In addition, we extensively verify the pre-trained video ViT models on a variety of downstream tasks, demonstrating its effectiveness as a general video representation learner. The code and model is available at \url{https://github.com/OpenGVLab/VideoMAEv2}.

Results

TaskDatasetMetricValueModel
VideoFineActionmAP18.24VideoMAE V2-g
VideoFineActionmAP IOU@0.529.07VideoMAE V2-g
VideoFineActionmAP IOU@0.7517.66VideoMAE V2-g
VideoFineActionmAP IOU@0.955.07VideoMAE V2-g
VideoTHUMOS’14Avg mAP (0.3:0.7)69.6ActionFormer (VideoMAE V2-g features)
VideoTHUMOS’14mAP IOU@0.384ActionFormer (VideoMAE V2-g features)
VideoTHUMOS’14mAP IOU@0.479.6ActionFormer (VideoMAE V2-g features)
VideoTHUMOS’14mAP IOU@0.573ActionFormer (VideoMAE V2-g features)
VideoTHUMOS’14mAP IOU@0.663.5ActionFormer (VideoMAE V2-g features)
VideoTHUMOS’14mAP IOU@0.747.7ActionFormer (VideoMAE V2-g features)
VideoKinetics-400Acc@190VideoMAE V2-g (64x266x266)
VideoKinetics-400Acc@598.4VideoMAE V2-g (64x266x266)
VideoKinetics-400Acc@188.5VideoMAE V2-g
VideoKinetics-400Acc@598.1VideoMAE V2-g
VideoKinetics-600Top-1 Accuracy89.9VideoMAE V2-g (64x266x266)
VideoKinetics-600Top-5 Accuracy98.5VideoMAE V2-g (64x266x266)
VideoKinetics-600Top-1 Accuracy88.8VideoMAE V2-g
VideoKinetics-600Top-5 Accuracy98.2VideoMAE V2-g
Temporal Action LocalizationFineActionmAP18.24VideoMAE V2-g
Temporal Action LocalizationFineActionmAP IOU@0.529.07VideoMAE V2-g
Temporal Action LocalizationFineActionmAP IOU@0.7517.66VideoMAE V2-g
Temporal Action LocalizationFineActionmAP IOU@0.955.07VideoMAE V2-g
Temporal Action LocalizationTHUMOS’14Avg mAP (0.3:0.7)69.6ActionFormer (VideoMAE V2-g features)
Temporal Action LocalizationTHUMOS’14mAP IOU@0.384ActionFormer (VideoMAE V2-g features)
Temporal Action LocalizationTHUMOS’14mAP IOU@0.479.6ActionFormer (VideoMAE V2-g features)
Temporal Action LocalizationTHUMOS’14mAP IOU@0.573ActionFormer (VideoMAE V2-g features)
Temporal Action LocalizationTHUMOS’14mAP IOU@0.663.5ActionFormer (VideoMAE V2-g features)
Temporal Action LocalizationTHUMOS’14mAP IOU@0.747.7ActionFormer (VideoMAE V2-g features)
Zero-Shot LearningFineActionmAP18.24VideoMAE V2-g
Zero-Shot LearningFineActionmAP IOU@0.529.07VideoMAE V2-g
Zero-Shot LearningFineActionmAP IOU@0.7517.66VideoMAE V2-g
Zero-Shot LearningFineActionmAP IOU@0.955.07VideoMAE V2-g
Zero-Shot LearningTHUMOS’14Avg mAP (0.3:0.7)69.6ActionFormer (VideoMAE V2-g features)
Zero-Shot LearningTHUMOS’14mAP IOU@0.384ActionFormer (VideoMAE V2-g features)
Zero-Shot LearningTHUMOS’14mAP IOU@0.479.6ActionFormer (VideoMAE V2-g features)
Zero-Shot LearningTHUMOS’14mAP IOU@0.573ActionFormer (VideoMAE V2-g features)
Zero-Shot LearningTHUMOS’14mAP IOU@0.663.5ActionFormer (VideoMAE V2-g features)
Zero-Shot LearningTHUMOS’14mAP IOU@0.747.7ActionFormer (VideoMAE V2-g features)
Activity RecognitionHMDB-51Average accuracy of 3 splits88.7VideoMAE V2-g
Activity RecognitionSomething-Something V1Top 1 Accuracy68.7VideoMAE V2-g
Activity RecognitionSomething-Something V1Top 5 Accuracy91.9VideoMAE V2-g
Activity RecognitionSomething-Something V2Parameters1013VideoMAE V2-g
Activity RecognitionSomething-Something V2Top-1 Accuracy77VideoMAE V2-g
Activity RecognitionSomething-Something V2Top-5 Accuracy95.9VideoMAE V2-g
Activity RecognitionUCF1013-fold Accuracy99.6VideoMAE V2-g
Activity RecognitionAVA v2.2mAP42.6VideoMAE V2-g
Activity RecognitionAVA v2.2mAP (Val)18.24VideoMAE V2
Activity RecognitionUCF1013-fold Accuracy99.6VideoMAE V2-g
Action LocalizationFineActionmAP18.24VideoMAE V2-g
Action LocalizationFineActionmAP IOU@0.529.07VideoMAE V2-g
Action LocalizationFineActionmAP IOU@0.7517.66VideoMAE V2-g
Action LocalizationFineActionmAP IOU@0.955.07VideoMAE V2-g
Action LocalizationTHUMOS’14Avg mAP (0.3:0.7)69.6ActionFormer (VideoMAE V2-g features)
Action LocalizationTHUMOS’14mAP IOU@0.384ActionFormer (VideoMAE V2-g features)
Action LocalizationTHUMOS’14mAP IOU@0.479.6ActionFormer (VideoMAE V2-g features)
Action LocalizationTHUMOS’14mAP IOU@0.573ActionFormer (VideoMAE V2-g features)
Action LocalizationTHUMOS’14mAP IOU@0.663.5ActionFormer (VideoMAE V2-g features)
Action LocalizationTHUMOS’14mAP IOU@0.747.7ActionFormer (VideoMAE V2-g features)
Action LocalizationAVA-Kineticsval mAP42.6VideoMAE V2-g
Action RecognitionHMDB-51Average accuracy of 3 splits88.7VideoMAE V2-g
Action RecognitionSomething-Something V1Top 1 Accuracy68.7VideoMAE V2-g
Action RecognitionSomething-Something V1Top 5 Accuracy91.9VideoMAE V2-g
Action RecognitionSomething-Something V2Parameters1013VideoMAE V2-g
Action RecognitionSomething-Something V2Top-1 Accuracy77VideoMAE V2-g
Action RecognitionSomething-Something V2Top-5 Accuracy95.9VideoMAE V2-g
Action RecognitionUCF1013-fold Accuracy99.6VideoMAE V2-g
Action RecognitionAVA v2.2mAP42.6VideoMAE V2-g
Action RecognitionAVA v2.2mAP (Val)18.24VideoMAE V2
Action RecognitionUCF1013-fold Accuracy99.6VideoMAE V2-g
Action Recognition In VideosAVA v2.2mAP (Val)18.24VideoMAE V2

Related Papers

A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16Zero-shot Skeleton-based Action Recognition with Prototype-guided Feature Alignment2025-07-01EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception2025-06-26Feature Hallucination for Self-supervised Action Recognition2025-06-25CARMA: Context-Aware Situational Grounding of Human-Robot Group Interactions by Combining Vision-Language Models with Object and Action Recognition2025-06-25Including Semantic Information via Word Embeddings for Skeleton-based Action Recognition2025-06-23Adapting Vision-Language Models for Evaluating World Models2025-06-22