Zhan Tong, Yibing Song, Jue Wang, LiMin Wang
Pre-training video transformers on extra large-scale datasets is generally required to achieve premier performance on relatively small datasets. In this paper, we show that video masked autoencoders (VideoMAE) are data-efficient learners for self-supervised video pre-training (SSVP). We are inspired by the recent ImageMAE and propose customized video tube masking with an extremely high ratio. This simple design makes video reconstruction a more challenging self-supervision task, thus encouraging extracting more effective video representations during this pre-training process. We obtain three important findings on SSVP: (1) An extremely high proportion of masking ratio (i.e., 90% to 95%) still yields favorable performance of VideoMAE. The temporally redundant video content enables a higher masking ratio than that of images. (2) VideoMAE achieves impressive results on very small datasets (i.e., around 3k-4k videos) without using any extra data. (3) VideoMAE shows that data quality is more important than data quantity for SSVP. Domain shift between pre-training and target datasets is an important issue. Notably, our VideoMAE with the vanilla ViT can achieve 87.4% on Kinetics-400, 75.4% on Something-Something V2, 91.3% on UCF101, and 62.6% on HMDB51, without using any extra data. Code is available at https://github.com/MCG-NJU/VideoMAE.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | Kinetics-400 | Acc@1 | 87.4 | VideoMAE (no extra data, ViT-H, 32x320x320) |
| Video | Kinetics-400 | Acc@5 | 97.6 | VideoMAE (no extra data, ViT-H, 32x320x320) |
| Video | Kinetics-400 | Acc@1 | 86.6 | VideoMAE (no extra data, ViT-H) |
| Video | Kinetics-400 | Acc@5 | 97.1 | VideoMAE (no extra data, ViT-H) |
| Video | Kinetics-400 | Acc@1 | 86.1 | VideoMAE (no extra data, ViT-L, 32x320x320) |
| Video | Kinetics-400 | Acc@5 | 97.3 | VideoMAE (no extra data, ViT-L, 32x320x320) |
| Video | Kinetics-400 | Acc@1 | 85.2 | VideoMAE (no extra data, ViT-L, 16x4) |
| Video | Kinetics-400 | Acc@5 | 96.8 | VideoMAE (no extra data, ViT-L, 16x4) |
| Video | Kinetics-400 | Acc@1 | 81.5 | VideoMAE (no extra data, ViT-B, 16x4) |
| Video | Kinetics-400 | Acc@5 | 95.1 | VideoMAE (no extra data, ViT-B, 16x4) |
| Activity Recognition | Something-Something V2 | Parameters | 305 | VideoMAE (no extra data, ViT-L, 32x2) |
| Activity Recognition | Something-Something V2 | Top-1 Accuracy | 75.4 | VideoMAE (no extra data, ViT-L, 32x2) |
| Activity Recognition | Something-Something V2 | Top-5 Accuracy | 95.2 | VideoMAE (no extra data, ViT-L, 32x2) |
| Activity Recognition | Something-Something V2 | Parameters | 305 | VideoMAE (no extra data, ViT-L, 16frame) |
| Activity Recognition | Something-Something V2 | Top-1 Accuracy | 74.3 | VideoMAE (no extra data, ViT-L, 16frame) |
| Activity Recognition | Something-Something V2 | Top-5 Accuracy | 94.6 | VideoMAE (no extra data, ViT-L, 16frame) |
| Activity Recognition | Something-Something V2 | Parameters | 87 | VideoMAE (no extra data, ViT-B, 16frame) |
| Activity Recognition | Something-Something V2 | Top-1 Accuracy | 70.8 | VideoMAE (no extra data, ViT-B, 16frame) |
| Activity Recognition | Something-Something V2 | Top-5 Accuracy | 92.4 | VideoMAE (no extra data, ViT-B, 16frame) |
| Activity Recognition | AVA v2.2 | mAP | 39.5 | VideoMAE (K400 pretrain+finetune, ViT-H, 16x4) |
| Activity Recognition | AVA v2.2 | mAP | 39.3 | VideoMAE (K700 pretrain+finetune, ViT-L, 16x4) |
| Activity Recognition | AVA v2.2 | mAP | 37.8 | VideoMAE (K400 pretrain+finetune, ViT-L, 16x4) |
| Activity Recognition | AVA v2.2 | mAP | 36.5 | VideoMAE (K400 pretrain, ViT-H, 16x4) |
| Activity Recognition | AVA v2.2 | mAP | 36.1 | VideoMAE (K700 pretrain, ViT-L, 16x4) |
| Activity Recognition | AVA v2.2 | mAP | 34.3 | VideoMAE (K400 pretrain, ViT-L, 16x4) |
| Activity Recognition | AVA v2.2 | mAP | 31.8 | VideoMAE (K400 pretrain+finetune, ViT-B, 16x4) |
| Activity Recognition | AVA v2.2 | mAP | 26.7 | VideoMAE (K400 pretrain, ViT-B, 16x4) |
| Activity Recognition | UCF101 | 3-fold Accuracy | 96.1 | VideoMAE |
| Activity Recognition | UCF101 | 3-fold Accuracy | 91.3 | VideoMAE(no extra data) |
| Activity Recognition | HMDB51 | Top-1 Accuracy | 73.3 | VideoMAE |
| Activity Recognition | HMDB51 | Top-1 Accuracy | 62.6 | VideoMAE(no extra data) |
| Action Recognition | Something-Something V2 | Parameters | 305 | VideoMAE (no extra data, ViT-L, 32x2) |
| Action Recognition | Something-Something V2 | Top-1 Accuracy | 75.4 | VideoMAE (no extra data, ViT-L, 32x2) |
| Action Recognition | Something-Something V2 | Top-5 Accuracy | 95.2 | VideoMAE (no extra data, ViT-L, 32x2) |
| Action Recognition | Something-Something V2 | Parameters | 305 | VideoMAE (no extra data, ViT-L, 16frame) |
| Action Recognition | Something-Something V2 | Top-1 Accuracy | 74.3 | VideoMAE (no extra data, ViT-L, 16frame) |
| Action Recognition | Something-Something V2 | Top-5 Accuracy | 94.6 | VideoMAE (no extra data, ViT-L, 16frame) |
| Action Recognition | Something-Something V2 | Parameters | 87 | VideoMAE (no extra data, ViT-B, 16frame) |
| Action Recognition | Something-Something V2 | Top-1 Accuracy | 70.8 | VideoMAE (no extra data, ViT-B, 16frame) |
| Action Recognition | Something-Something V2 | Top-5 Accuracy | 92.4 | VideoMAE (no extra data, ViT-B, 16frame) |
| Action Recognition | AVA v2.2 | mAP | 39.5 | VideoMAE (K400 pretrain+finetune, ViT-H, 16x4) |
| Action Recognition | AVA v2.2 | mAP | 39.3 | VideoMAE (K700 pretrain+finetune, ViT-L, 16x4) |
| Action Recognition | AVA v2.2 | mAP | 37.8 | VideoMAE (K400 pretrain+finetune, ViT-L, 16x4) |
| Action Recognition | AVA v2.2 | mAP | 36.5 | VideoMAE (K400 pretrain, ViT-H, 16x4) |
| Action Recognition | AVA v2.2 | mAP | 36.1 | VideoMAE (K700 pretrain, ViT-L, 16x4) |
| Action Recognition | AVA v2.2 | mAP | 34.3 | VideoMAE (K400 pretrain, ViT-L, 16x4) |
| Action Recognition | AVA v2.2 | mAP | 31.8 | VideoMAE (K400 pretrain+finetune, ViT-B, 16x4) |
| Action Recognition | AVA v2.2 | mAP | 26.7 | VideoMAE (K400 pretrain, ViT-B, 16x4) |
| Action Recognition | UCF101 | 3-fold Accuracy | 96.1 | VideoMAE |
| Action Recognition | UCF101 | 3-fold Accuracy | 91.3 | VideoMAE(no extra data) |
| Action Recognition | HMDB51 | Top-1 Accuracy | 73.3 | VideoMAE |
| Action Recognition | HMDB51 | Top-1 Accuracy | 62.6 | VideoMAE(no extra data) |