Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, Christoph Feichtenhofer
We present Multiscale Vision Transformers (MViT) for video and image recognition, by connecting the seminal idea of multiscale feature hierarchies with transformer models. Multiscale Transformers have several channel-resolution scale stages. Starting from the input resolution and a small channel dimension, the stages hierarchically expand the channel capacity while reducing the spatial resolution. This creates a multiscale pyramid of features with early layers operating at high spatial resolution to model simple low-level visual information, and deeper layers at spatially coarse, but complex, high-dimensional features. We evaluate this fundamental architectural prior for modeling the dense nature of visual signals for a variety of video recognition tasks where it outperforms concurrent vision transformers that rely on large scale external pre-training and are 5-10x more costly in computation and parameters. We further remove the temporal dimension and apply our model for image classification where it outperforms prior work on vision transformers. Code is available at: https://github.com/facebookresearch/SlowFast
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | Charades | MAP | 47.7 | MViT-B-24, 32x3 (Kinetics-600 pretraining) |
| Video | Charades | MAP | 47.1 | MViT-B, 32x3 (Kinetics-600 pretraining) |
| Video | Charades | MAP | 46.3 | MViT-B-24, 32x3 (Kinetics-400 pretraining) |
| Video | Charades | MAP | 44.3 | MViT-B, 32x3 (Kinetics-400 pretraining) |
| Video | Charades | MAP | 43.9 | MViT-B, 16x4 (Kinetics-600 pretraining) |
| Video | Charades | MAP | 40 | MViT-B, 16x4 (Kinetics-400 pretraining) |
| Video | Kinetics-400 | Acc@1 | 81.2 | MViT-B, 64x3 |
| Video | Kinetics-400 | Acc@5 | 95.1 | MViT-B, 64x3 |
| Video | Kinetics-400 | Acc@1 | 80.2 | MViT-B, 32x3 |
| Video | Kinetics-400 | Acc@5 | 94.4 | MViT-B, 32x3 |
| Video | Kinetics-400 | Acc@1 | 78.4 | MViT-B, 16x4 |
| Video | Kinetics-400 | Acc@5 | 93.5 | MViT-B, 16x4 |
| Video | Kinetics-400 | Acc@1 | 76 | MViT-S |
| Video | Kinetics-400 | Acc@5 | 92.1 | MViT-S |
| Video | Kinetics-600 | Top-1 Accuracy | 83.8 | MViT-B-24, 32x3 |
| Video | Kinetics-600 | Top-5 Accuracy | 96.3 | MViT-B-24, 32x3 |
| Video | Kinetics-600 | Top-1 Accuracy | 83.4 | MViT-B, 32x3 |
| Video | Kinetics-600 | Top-5 Accuracy | 96.3 | MViT-B, 32x3 |
| Video | Kinetics-600 | Top-1 Accuracy | 82.1 | MViT-B, 16x4 |
| Video | Kinetics-600 | Top-5 Accuracy | 95.7 | MViT-B, 16x4 |
| Activity Recognition | Something-Something V2 | Top-1 Accuracy | 68.7 | MViT-B-24, 32x3 |
| Activity Recognition | Something-Something V2 | Top-5 Accuracy | 91.5 | MViT-B-24, 32x3 |
| Activity Recognition | Something-Something V2 | Parameters | 36.6 | MViT-B, 32x3(Kinetics600 pretrain) |
| Activity Recognition | Something-Something V2 | Top-1 Accuracy | 67.8 | MViT-B, 32x3(Kinetics600 pretrain) |
| Activity Recognition | Something-Something V2 | Top-5 Accuracy | 91.3 | MViT-B, 32x3(Kinetics600 pretrain) |
| Activity Recognition | Something-Something V2 | Top-1 Accuracy | 66.2 | MViT-B, 16x4 |
| Activity Recognition | Something-Something V2 | Top-5 Accuracy | 90.2 | MViT-B, 16x4 |
| Activity Recognition | AVA v2.2 | mAP | 28.7 | MViT-B-24, 32x3 (Kinetics-600 pretraining) |
| Activity Recognition | AVA v2.2 | mAP | 27.5 | MViT-B, 32x3 (Kinetics-500 pretraining) |
| Activity Recognition | AVA v2.2 | mAP | 27.3 | MViT-B, 64x3 (Kinetics-400 pretraining) |
| Activity Recognition | AVA v2.2 | mAP | 26.8 | MViT-B, 32x3 (Kinetics-400 pretraining) |
| Activity Recognition | AVA v2.2 | mAP | 26.1 | MViT-B, 16x4 (Kinetics-600 pretraining) |
| Activity Recognition | AVA v2.2 | mAP | 24.5 | MViT-B, 16x4 (Kinetics-400 pretraining) |
| Image Classification | ImageNet | GFLOPs | 32.7 | MViT-B-24 |
| Image Classification | ImageNet | GFLOPs | 7.8 | MViT-B-16 |
| Action Recognition | Something-Something V2 | Top-1 Accuracy | 68.7 | MViT-B-24, 32x3 |
| Action Recognition | Something-Something V2 | Top-5 Accuracy | 91.5 | MViT-B-24, 32x3 |
| Action Recognition | Something-Something V2 | Parameters | 36.6 | MViT-B, 32x3(Kinetics600 pretrain) |
| Action Recognition | Something-Something V2 | Top-1 Accuracy | 67.8 | MViT-B, 32x3(Kinetics600 pretrain) |
| Action Recognition | Something-Something V2 | Top-5 Accuracy | 91.3 | MViT-B, 32x3(Kinetics600 pretrain) |
| Action Recognition | Something-Something V2 | Top-1 Accuracy | 66.2 | MViT-B, 16x4 |
| Action Recognition | Something-Something V2 | Top-5 Accuracy | 90.2 | MViT-B, 16x4 |
| Action Recognition | AVA v2.2 | mAP | 28.7 | MViT-B-24, 32x3 (Kinetics-600 pretraining) |
| Action Recognition | AVA v2.2 | mAP | 27.5 | MViT-B, 32x3 (Kinetics-500 pretraining) |
| Action Recognition | AVA v2.2 | mAP | 27.3 | MViT-B, 64x3 (Kinetics-400 pretraining) |
| Action Recognition | AVA v2.2 | mAP | 26.8 | MViT-B, 32x3 (Kinetics-400 pretraining) |
| Action Recognition | AVA v2.2 | mAP | 26.1 | MViT-B, 16x4 (Kinetics-600 pretraining) |
| Action Recognition | AVA v2.2 | mAP | 24.5 | MViT-B, 16x4 (Kinetics-400 pretraining) |