Yanghao Li, Chao-yuan Wu, Haoqi Fan, Karttikeya Mangalam, Bo Xiong, Jitendra Malik, Christoph Feichtenhofer
In this paper, we study Multiscale Vision Transformers (MViTv2) as a unified architecture for image and video classification, as well as object detection. We present an improved version of MViT that incorporates decomposed relative positional embeddings and residual pooling connections. We instantiate this architecture in five sizes and evaluate it for ImageNet classification, COCO detection and Kinetics video recognition where it outperforms prior work. We further compare MViTv2s' pooling attention to window attention mechanisms where it outperforms the latter in accuracy/compute. Without bells-and-whistles, MViTv2 has state-of-the-art performance in 3 domains: 88.8% accuracy on ImageNet classification, 58.7 boxAP on COCO object detection as well as 86.1% on Kinetics-400 video classification. Code and models are available at https://github.com/facebookresearch/mvit.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | Kinetics-700 | Top-1 Accuracy | 79.4 | MViTv2-L (ImageNet-21k pretrain) |
| Video | Kinetics-700 | Top-5 Accuracy | 94.9 | MViTv2-L (ImageNet-21k pretrain) |
| Video | Kinetics-700 | Top-1 Accuracy | 79.4 | MoViNet-A6 |
| Video | Kinetics-700 | Top-1 Accuracy | 76.6 | MViTv2-B |
| Video | Kinetics-700 | Top-5 Accuracy | 93.2 | MViTv2-B |
| Video | Kinetics-400 | Acc@1 | 86.1 | MViTv2-L (ImageNet-21k pretrain) |
| Video | Kinetics-400 | Acc@5 | 97 | MViTv2-L (ImageNet-21k pretrain) |
| Video | Kinetics-600 | Top-1 Accuracy | 87.9 | MViTv2-L (ImageNet-21k pretrain) |
| Video | Kinetics-600 | Top-5 Accuracy | 97.9 | MViTv2-L (ImageNet-21k pretrain) |
| Video | Kinetics-600 | Top-1 Accuracy | 85.5 | MViTv2-L (train from scratch) |
| Video | Kinetics-600 | Top-5 Accuracy | 97.2 | MViTv2-B (train from scratch) |
| Activity Recognition | Something-Something V2 | Parameters | 213.1 | MViTv2-L (IN-21K + Kinetics400 pretrain) |
| Activity Recognition | Something-Something V2 | Top-1 Accuracy | 73.3 | MViTv2-L (IN-21K + Kinetics400 pretrain) |
| Activity Recognition | Something-Something V2 | Top-5 Accuracy | 94.1 | MViTv2-L (IN-21K + Kinetics400 pretrain) |
| Activity Recognition | Something-Something V2 | Top-1 Accuracy | 72.1 | MViT-B (IN-21K + Kinetics400 pretrain) |
| Activity Recognition | Something-Something V2 | Parameters | 51.1 | MViTv2-B (IN-21K + Kinetics400 pretrain) |
| Activity Recognition | Something-Something V2 | Top-5 Accuracy | 93.4 | MViTv2-B (IN-21K + Kinetics400 pretrain) |
| Activity Recognition | AVA v2.2 | mAP | 34.4 | MViTv2-L (IN21k, K700) |
| Object Detection | COCO-O | Average mAP | 30.9 | MViTV2-H (Cascade Mask R-CNN) |
| Object Detection | COCO-O | Effective Robustness | 5.62 | MViTV2-H (Cascade Mask R-CNN) |
| Object Detection | COCO minival | box AP | 58.7 | MViTv2-L (Cascade Mask R-CNN, multi-scale, IN21k pre-train) |
| Object Detection | COCO minival | box AP | 56.1 | MViTv2-H (Cascade Mask R-CNN, single-scale, IN21k pre-train) |
| Object Detection | COCO minival | box AP | 54.3 | MViTv2-L (Cascade Mask R-CNN, single-scale) |
| Object Detection | COCO minival | box AP | 52.7 | MViT-L (Mask R-CNN, single-scale, IN21k pre-train) |
| Image Classification | ImageNet | GFLOPs | 763.5 | MViTv2-H (512 res, ImageNet-21k pretrain) |
| Image Classification | ImageNet | GFLOPs | 140.7 | MViTv2-L (384 res, ImageNet-21k pretrain) |
| Image Classification | ImageNet | GFLOPs | 120.6 | MViTv2-H (mageNet-21k pretrain) |
| Image Classification | ImageNet | GFLOPs | 140.2 | MViTv2-L (384 res) |
| Image Classification | ImageNet | GFLOPs | 4.7 | MViTv2-T |
| 3D | COCO-O | Average mAP | 30.9 | MViTV2-H (Cascade Mask R-CNN) |
| 3D | COCO-O | Effective Robustness | 5.62 | MViTV2-H (Cascade Mask R-CNN) |
| 3D | COCO minival | box AP | 58.7 | MViTv2-L (Cascade Mask R-CNN, multi-scale, IN21k pre-train) |
| 3D | COCO minival | box AP | 56.1 | MViTv2-H (Cascade Mask R-CNN, single-scale, IN21k pre-train) |
| 3D | COCO minival | box AP | 54.3 | MViTv2-L (Cascade Mask R-CNN, single-scale) |
| 3D | COCO minival | box AP | 52.7 | MViT-L (Mask R-CNN, single-scale, IN21k pre-train) |
| Instance Segmentation | COCO minival | mask AP | 50.5 | MViTv2-L (Cascade Mask R-CNN, multi-scale, IN21k pre-train) |
| Instance Segmentation | COCO minival | mask AP | 48.5 | MViTv2-H (Cascade Mask R-CNN, single-scale, IN21k pre-train) |
| Instance Segmentation | COCO minival | mask AP | 47.1 | MViTv2-L (Cascade Mask R-CNN, single-scale) |
| Instance Segmentation | COCO minival | mask AP | 46.2 | MViT-L (Mask R-CNN, single-scale) |
| Action Recognition | Something-Something V2 | Parameters | 213.1 | MViTv2-L (IN-21K + Kinetics400 pretrain) |
| Action Recognition | Something-Something V2 | Top-1 Accuracy | 73.3 | MViTv2-L (IN-21K + Kinetics400 pretrain) |
| Action Recognition | Something-Something V2 | Top-5 Accuracy | 94.1 | MViTv2-L (IN-21K + Kinetics400 pretrain) |
| Action Recognition | Something-Something V2 | Top-1 Accuracy | 72.1 | MViT-B (IN-21K + Kinetics400 pretrain) |
| Action Recognition | Something-Something V2 | Parameters | 51.1 | MViTv2-B (IN-21K + Kinetics400 pretrain) |
| Action Recognition | Something-Something V2 | Top-5 Accuracy | 93.4 | MViTv2-B (IN-21K + Kinetics400 pretrain) |
| Action Recognition | AVA v2.2 | mAP | 34.4 | MViTv2-L (IN21k, K700) |
| 2D Classification | COCO-O | Average mAP | 30.9 | MViTV2-H (Cascade Mask R-CNN) |
| 2D Classification | COCO-O | Effective Robustness | 5.62 | MViTV2-H (Cascade Mask R-CNN) |
| 2D Classification | COCO minival | box AP | 58.7 | MViTv2-L (Cascade Mask R-CNN, multi-scale, IN21k pre-train) |
| 2D Classification | COCO minival | box AP | 56.1 | MViTv2-H (Cascade Mask R-CNN, single-scale, IN21k pre-train) |
| 2D Classification | COCO minival | box AP | 54.3 | MViTv2-L (Cascade Mask R-CNN, single-scale) |
| 2D Classification | COCO minival | box AP | 52.7 | MViT-L (Mask R-CNN, single-scale, IN21k pre-train) |
| 2D Object Detection | COCO-O | Average mAP | 30.9 | MViTV2-H (Cascade Mask R-CNN) |
| 2D Object Detection | COCO-O | Effective Robustness | 5.62 | MViTV2-H (Cascade Mask R-CNN) |
| 2D Object Detection | COCO minival | box AP | 58.7 | MViTv2-L (Cascade Mask R-CNN, multi-scale, IN21k pre-train) |
| 2D Object Detection | COCO minival | box AP | 56.1 | MViTv2-H (Cascade Mask R-CNN, single-scale, IN21k pre-train) |
| 2D Object Detection | COCO minival | box AP | 54.3 | MViTv2-L (Cascade Mask R-CNN, single-scale) |
| 2D Object Detection | COCO minival | box AP | 52.7 | MViT-L (Mask R-CNN, single-scale, IN21k pre-train) |
| 16k | COCO-O | Average mAP | 30.9 | MViTV2-H (Cascade Mask R-CNN) |
| 16k | COCO-O | Effective Robustness | 5.62 | MViTV2-H (Cascade Mask R-CNN) |
| 16k | COCO minival | box AP | 58.7 | MViTv2-L (Cascade Mask R-CNN, multi-scale, IN21k pre-train) |
| 16k | COCO minival | box AP | 56.1 | MViTv2-H (Cascade Mask R-CNN, single-scale, IN21k pre-train) |
| 16k | COCO minival | box AP | 54.3 | MViTv2-L (Cascade Mask R-CNN, single-scale) |
| 16k | COCO minival | box AP | 52.7 | MViT-L (Mask R-CNN, single-scale, IN21k pre-train) |