VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking

LiMin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan He, Yi Wang, Yali Wang, Yu Qiao

2023-03-29CVPR 2023 1Action Classification Spatio-Temporal Action Localization Action Recognition Action Recognition In Videos Temporal Action Localization Self-Supervised Action Recognition

Paper PDF Code(official)

Abstract

Scale is the primary factor for building a powerful foundation model that could well generalize to a variety of downstream tasks. However, it is still challenging to train video foundation models with billions of parameters. This paper shows that video masked autoencoder (VideoMAE) is a scalable and general self-supervised pre-trainer for building video foundation models. We scale the VideoMAE in both model and data with a core design. Specifically, we present a dual masking strategy for efficient pre-training, with an encoder operating on a subset of video tokens and a decoder processing another subset of video tokens. Although VideoMAE is very efficient due to high masking ratio in encoder, masking decoder can still further reduce the overall computational cost. This enables the efficient pre-training of billion-level models in video. We also use a progressive training paradigm that involves an initial pre-training on a diverse multi-sourced unlabeled dataset, followed by a post-pre-training on a mixed labeled dataset. Finally, we successfully train a video ViT model with a billion parameters, which achieves a new state-of-the-art performance on the datasets of Kinetics (90.0% on K400 and 89.9% on K600) and Something-Something (68.7% on V1 and 77.0% on V2). In addition, we extensively verify the pre-trained video ViT models on a variety of downstream tasks, demonstrating its effectiveness as a general video representation learner. The code and model is available at \url{https://github.com/OpenGVLab/VideoMAEv2}.

Results

Task	Dataset	Metric	Value	Model
Video	FineAction	mAP	18.24	VideoMAE V2-g
Video	FineAction	mAP IOU@0.5	29.07	VideoMAE V2-g
Video	FineAction	mAP IOU@0.75	17.66	VideoMAE V2-g
Video	FineAction	mAP IOU@0.95	5.07	VideoMAE V2-g
Video	THUMOS’14	Avg mAP (0.3:0.7)	69.6	ActionFormer (VideoMAE V2-g features)
Video	THUMOS’14	mAP IOU@0.3	84	ActionFormer (VideoMAE V2-g features)
Video	THUMOS’14	mAP IOU@0.4	79.6	ActionFormer (VideoMAE V2-g features)
Video	THUMOS’14	mAP IOU@0.5	73	ActionFormer (VideoMAE V2-g features)
Video	THUMOS’14	mAP IOU@0.6	63.5	ActionFormer (VideoMAE V2-g features)
Video	THUMOS’14	mAP IOU@0.7	47.7	ActionFormer (VideoMAE V2-g features)
Video	Kinetics-400	Acc@1	90	VideoMAE V2-g (64x266x266)
Video	Kinetics-400	Acc@5	98.4	VideoMAE V2-g (64x266x266)
Video	Kinetics-400	Acc@1	88.5	VideoMAE V2-g
Video	Kinetics-400	Acc@5	98.1	VideoMAE V2-g
Video	Kinetics-600	Top-1 Accuracy	89.9	VideoMAE V2-g (64x266x266)
Video	Kinetics-600	Top-5 Accuracy	98.5	VideoMAE V2-g (64x266x266)
Video	Kinetics-600	Top-1 Accuracy	88.8	VideoMAE V2-g
Video	Kinetics-600	Top-5 Accuracy	98.2	VideoMAE V2-g
Temporal Action Localization	FineAction	mAP	18.24	VideoMAE V2-g
Temporal Action Localization	FineAction	mAP IOU@0.5	29.07	VideoMAE V2-g
Temporal Action Localization	FineAction	mAP IOU@0.75	17.66	VideoMAE V2-g
Temporal Action Localization	FineAction	mAP IOU@0.95	5.07	VideoMAE V2-g
Temporal Action Localization	THUMOS’14	Avg mAP (0.3:0.7)	69.6	ActionFormer (VideoMAE V2-g features)
Temporal Action Localization	THUMOS’14	mAP IOU@0.3	84	ActionFormer (VideoMAE V2-g features)
Temporal Action Localization	THUMOS’14	mAP IOU@0.4	79.6	ActionFormer (VideoMAE V2-g features)
Temporal Action Localization	THUMOS’14	mAP IOU@0.5	73	ActionFormer (VideoMAE V2-g features)
Temporal Action Localization	THUMOS’14	mAP IOU@0.6	63.5	ActionFormer (VideoMAE V2-g features)
Temporal Action Localization	THUMOS’14	mAP IOU@0.7	47.7	ActionFormer (VideoMAE V2-g features)
Zero-Shot Learning	FineAction	mAP	18.24	VideoMAE V2-g
Zero-Shot Learning	FineAction	mAP IOU@0.5	29.07	VideoMAE V2-g
Zero-Shot Learning	FineAction	mAP IOU@0.75	17.66	VideoMAE V2-g
Zero-Shot Learning	FineAction	mAP IOU@0.95	5.07	VideoMAE V2-g
Zero-Shot Learning	THUMOS’14	Avg mAP (0.3:0.7)	69.6	ActionFormer (VideoMAE V2-g features)
Zero-Shot Learning	THUMOS’14	mAP IOU@0.3	84	ActionFormer (VideoMAE V2-g features)
Zero-Shot Learning	THUMOS’14	mAP IOU@0.4	79.6	ActionFormer (VideoMAE V2-g features)
Zero-Shot Learning	THUMOS’14	mAP IOU@0.5	73	ActionFormer (VideoMAE V2-g features)
Zero-Shot Learning	THUMOS’14	mAP IOU@0.6	63.5	ActionFormer (VideoMAE V2-g features)
Zero-Shot Learning	THUMOS’14	mAP IOU@0.7	47.7	ActionFormer (VideoMAE V2-g features)
Activity Recognition	HMDB-51	Average accuracy of 3 splits	88.7	VideoMAE V2-g
Activity Recognition	Something-Something V1	Top 1 Accuracy	68.7	VideoMAE V2-g
Activity Recognition	Something-Something V1	Top 5 Accuracy	91.9	VideoMAE V2-g
Activity Recognition	Something-Something V2	Parameters	1013	VideoMAE V2-g
Activity Recognition	Something-Something V2	Top-1 Accuracy	77	VideoMAE V2-g
Activity Recognition	Something-Something V2	Top-5 Accuracy	95.9	VideoMAE V2-g
Activity Recognition	UCF101	3-fold Accuracy	99.6	VideoMAE V2-g
Activity Recognition	AVA v2.2	mAP	42.6	VideoMAE V2-g
Activity Recognition	AVA v2.2	mAP (Val)	18.24	VideoMAE V2
Activity Recognition	UCF101	3-fold Accuracy	99.6	VideoMAE V2-g
Action Localization	FineAction	mAP	18.24	VideoMAE V2-g
Action Localization	FineAction	mAP IOU@0.5	29.07	VideoMAE V2-g
Action Localization	FineAction	mAP IOU@0.75	17.66	VideoMAE V2-g
Action Localization	FineAction	mAP IOU@0.95	5.07	VideoMAE V2-g
Action Localization	THUMOS’14	Avg mAP (0.3:0.7)	69.6	ActionFormer (VideoMAE V2-g features)
Action Localization	THUMOS’14	mAP IOU@0.3	84	ActionFormer (VideoMAE V2-g features)
Action Localization	THUMOS’14	mAP IOU@0.4	79.6	ActionFormer (VideoMAE V2-g features)
Action Localization	THUMOS’14	mAP IOU@0.5	73	ActionFormer (VideoMAE V2-g features)
Action Localization	THUMOS’14	mAP IOU@0.6	63.5	ActionFormer (VideoMAE V2-g features)
Action Localization	THUMOS’14	mAP IOU@0.7	47.7	ActionFormer (VideoMAE V2-g features)
Action Localization	AVA-Kinetics	val mAP	42.6	VideoMAE V2-g
Action Recognition	HMDB-51	Average accuracy of 3 splits	88.7	VideoMAE V2-g
Action Recognition	Something-Something V1	Top 1 Accuracy	68.7	VideoMAE V2-g
Action Recognition	Something-Something V1	Top 5 Accuracy	91.9	VideoMAE V2-g
Action Recognition	Something-Something V2	Parameters	1013	VideoMAE V2-g
Action Recognition	Something-Something V2	Top-1 Accuracy	77	VideoMAE V2-g
Action Recognition	Something-Something V2	Top-5 Accuracy	95.9	VideoMAE V2-g
Action Recognition	UCF101	3-fold Accuracy	99.6	VideoMAE V2-g
Action Recognition	AVA v2.2	mAP	42.6	VideoMAE V2-g
Action Recognition	AVA v2.2	mAP (Val)	18.24	VideoMAE V2
Action Recognition	UCF101	3-fold Accuracy	99.6	VideoMAE V2-g
Action Recognition In Videos	AVA v2.2	mAP (Val)	18.24	VideoMAE V2

Abstract

Results

Task	Dataset	Metric	Value	Model
Video	FineAction	mAP	18.24	VideoMAE V2-g
Video	FineAction	mAP IOU@0.5	29.07	VideoMAE V2-g
Video	FineAction	mAP IOU@0.75	17.66	VideoMAE V2-g
Video	FineAction	mAP IOU@0.95	5.07	VideoMAE V2-g
Video	THUMOS’14	Avg mAP (0.3:0.7)	69.6	ActionFormer (VideoMAE V2-g features)
Video	THUMOS’14	mAP IOU@0.3	84	ActionFormer (VideoMAE V2-g features)
Video	THUMOS’14	mAP IOU@0.4	79.6	ActionFormer (VideoMAE V2-g features)
Video	THUMOS’14	mAP IOU@0.5	73	ActionFormer (VideoMAE V2-g features)
Video	THUMOS’14	mAP IOU@0.6	63.5	ActionFormer (VideoMAE V2-g features)
Video	THUMOS’14	mAP IOU@0.7	47.7	ActionFormer (VideoMAE V2-g features)
Video	Kinetics-400	Acc@1	90	VideoMAE V2-g (64x266x266)
Video	Kinetics-400	Acc@5	98.4	VideoMAE V2-g (64x266x266)
Video	Kinetics-400	Acc@1	88.5	VideoMAE V2-g
Video	Kinetics-400	Acc@5	98.1	VideoMAE V2-g
Video	Kinetics-600	Top-1 Accuracy	89.9	VideoMAE V2-g (64x266x266)
Video	Kinetics-600	Top-5 Accuracy	98.5	VideoMAE V2-g (64x266x266)
Video	Kinetics-600	Top-1 Accuracy	88.8	VideoMAE V2-g
Video	Kinetics-600	Top-5 Accuracy	98.2	VideoMAE V2-g
Temporal Action Localization	FineAction	mAP	18.24	VideoMAE V2-g
Temporal Action Localization	FineAction	mAP IOU@0.5	29.07	VideoMAE V2-g
Temporal Action Localization	FineAction	mAP IOU@0.75	17.66	VideoMAE V2-g
Temporal Action Localization	FineAction	mAP IOU@0.95	5.07	VideoMAE V2-g
Temporal Action Localization	THUMOS’14	Avg mAP (0.3:0.7)	69.6	ActionFormer (VideoMAE V2-g features)
Temporal Action Localization	THUMOS’14	mAP IOU@0.3	84	ActionFormer (VideoMAE V2-g features)
Temporal Action Localization	THUMOS’14	mAP IOU@0.4	79.6	ActionFormer (VideoMAE V2-g features)
Temporal Action Localization	THUMOS’14	mAP IOU@0.5	73	ActionFormer (VideoMAE V2-g features)
Temporal Action Localization	THUMOS’14	mAP IOU@0.6	63.5	ActionFormer (VideoMAE V2-g features)
Temporal Action Localization	THUMOS’14	mAP IOU@0.7	47.7	ActionFormer (VideoMAE V2-g features)
Zero-Shot Learning	FineAction	mAP	18.24	VideoMAE V2-g
Zero-Shot Learning	FineAction	mAP IOU@0.5	29.07	VideoMAE V2-g
Zero-Shot Learning	FineAction	mAP IOU@0.75	17.66	VideoMAE V2-g
Zero-Shot Learning	FineAction	mAP IOU@0.95	5.07	VideoMAE V2-g
Zero-Shot Learning	THUMOS’14	Avg mAP (0.3:0.7)	69.6	ActionFormer (VideoMAE V2-g features)
Zero-Shot Learning	THUMOS’14	mAP IOU@0.3	84	ActionFormer (VideoMAE V2-g features)
Zero-Shot Learning	THUMOS’14	mAP IOU@0.4	79.6	ActionFormer (VideoMAE V2-g features)
Zero-Shot Learning	THUMOS’14	mAP IOU@0.5	73	ActionFormer (VideoMAE V2-g features)
Zero-Shot Learning	THUMOS’14	mAP IOU@0.6	63.5	ActionFormer (VideoMAE V2-g features)
Zero-Shot Learning	THUMOS’14	mAP IOU@0.7	47.7	ActionFormer (VideoMAE V2-g features)
Activity Recognition	HMDB-51	Average accuracy of 3 splits	88.7	VideoMAE V2-g
Activity Recognition	Something-Something V1	Top 1 Accuracy	68.7	VideoMAE V2-g
Activity Recognition	Something-Something V1	Top 5 Accuracy	91.9	VideoMAE V2-g
Activity Recognition	Something-Something V2	Parameters	1013	VideoMAE V2-g
Activity Recognition	Something-Something V2	Top-1 Accuracy	77	VideoMAE V2-g
Activity Recognition	Something-Something V2	Top-5 Accuracy	95.9	VideoMAE V2-g
Activity Recognition	UCF101	3-fold Accuracy	99.6	VideoMAE V2-g
Activity Recognition	AVA v2.2	mAP	42.6	VideoMAE V2-g
Activity Recognition	AVA v2.2	mAP (Val)	18.24	VideoMAE V2
Activity Recognition	UCF101	3-fold Accuracy	99.6	VideoMAE V2-g
Action Localization	FineAction	mAP	18.24	VideoMAE V2-g
Action Localization	FineAction	mAP IOU@0.5	29.07	VideoMAE V2-g
Action Localization	FineAction	mAP IOU@0.75	17.66	VideoMAE V2-g
Action Localization	FineAction	mAP IOU@0.95	5.07	VideoMAE V2-g
Action Localization	THUMOS’14	Avg mAP (0.3:0.7)	69.6	ActionFormer (VideoMAE V2-g features)
Action Localization	THUMOS’14	mAP IOU@0.3	84	ActionFormer (VideoMAE V2-g features)
Action Localization	THUMOS’14	mAP IOU@0.4	79.6	ActionFormer (VideoMAE V2-g features)
Action Localization	THUMOS’14	mAP IOU@0.5	73	ActionFormer (VideoMAE V2-g features)
Action Localization	THUMOS’14	mAP IOU@0.6	63.5	ActionFormer (VideoMAE V2-g features)
Action Localization	THUMOS’14	mAP IOU@0.7	47.7	ActionFormer (VideoMAE V2-g features)
Action Localization	AVA-Kinetics	val mAP	42.6	VideoMAE V2-g
Action Recognition	HMDB-51	Average accuracy of 3 splits	88.7	VideoMAE V2-g
Action Recognition	Something-Something V1	Top 1 Accuracy	68.7	VideoMAE V2-g
Action Recognition	Something-Something V1	Top 5 Accuracy	91.9	VideoMAE V2-g
Action Recognition	Something-Something V2	Parameters	1013	VideoMAE V2-g
Action Recognition	Something-Something V2	Top-1 Accuracy	77	VideoMAE V2-g
Action Recognition	Something-Something V2	Top-5 Accuracy	95.9	VideoMAE V2-g
Action Recognition	UCF101	3-fold Accuracy	99.6	VideoMAE V2-g
Action Recognition	AVA v2.2	mAP	42.6	VideoMAE V2-g
Action Recognition	AVA v2.2	mAP (Val)	18.24	VideoMAE V2
Action Recognition	UCF101	3-fold Accuracy	99.6	VideoMAE V2-g
Action Recognition In Videos	AVA v2.2	mAP (Val)	18.24	VideoMAE V2

VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking

Abstract

Results

Related Papers

VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking

Abstract

Results

Related Papers