TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/MViTv2: Improved Multiscale Vision Transformers for Classi...

MViTv2: Improved Multiscale Vision Transformers for Classification and Detection

Yanghao Li, Chao-yuan Wu, Haoqi Fan, Karttikeya Mangalam, Bo Xiong, Jitendra Malik, Christoph Feichtenhofer

2021-12-02CVPR 2022 1Image ClassificationAction ClassificationVideo RecognitionInstance SegmentationVideo ClassificationAction RecognitionObject Detection
PaperPDFCodeCode(official)CodeCodeCodeCodeCodeCode(official)Code(official)

Abstract

In this paper, we study Multiscale Vision Transformers (MViTv2) as a unified architecture for image and video classification, as well as object detection. We present an improved version of MViT that incorporates decomposed relative positional embeddings and residual pooling connections. We instantiate this architecture in five sizes and evaluate it for ImageNet classification, COCO detection and Kinetics video recognition where it outperforms prior work. We further compare MViTv2s' pooling attention to window attention mechanisms where it outperforms the latter in accuracy/compute. Without bells-and-whistles, MViTv2 has state-of-the-art performance in 3 domains: 88.8% accuracy on ImageNet classification, 58.7 boxAP on COCO object detection as well as 86.1% on Kinetics-400 video classification. Code and models are available at https://github.com/facebookresearch/mvit.

Results

TaskDatasetMetricValueModel
VideoKinetics-700Top-1 Accuracy79.4MViTv2-L (ImageNet-21k pretrain)
VideoKinetics-700Top-5 Accuracy94.9MViTv2-L (ImageNet-21k pretrain)
VideoKinetics-700Top-1 Accuracy79.4MoViNet-A6
VideoKinetics-700Top-1 Accuracy76.6MViTv2-B
VideoKinetics-700Top-5 Accuracy93.2MViTv2-B
VideoKinetics-400Acc@186.1MViTv2-L (ImageNet-21k pretrain)
VideoKinetics-400Acc@597MViTv2-L (ImageNet-21k pretrain)
VideoKinetics-600Top-1 Accuracy87.9MViTv2-L (ImageNet-21k pretrain)
VideoKinetics-600Top-5 Accuracy97.9MViTv2-L (ImageNet-21k pretrain)
VideoKinetics-600Top-1 Accuracy85.5MViTv2-L (train from scratch)
VideoKinetics-600Top-5 Accuracy97.2MViTv2-B (train from scratch)
Activity RecognitionSomething-Something V2Parameters213.1MViTv2-L (IN-21K + Kinetics400 pretrain)
Activity RecognitionSomething-Something V2Top-1 Accuracy73.3MViTv2-L (IN-21K + Kinetics400 pretrain)
Activity RecognitionSomething-Something V2Top-5 Accuracy94.1MViTv2-L (IN-21K + Kinetics400 pretrain)
Activity RecognitionSomething-Something V2Top-1 Accuracy72.1MViT-B (IN-21K + Kinetics400 pretrain)
Activity RecognitionSomething-Something V2Parameters51.1MViTv2-B (IN-21K + Kinetics400 pretrain)
Activity RecognitionSomething-Something V2Top-5 Accuracy93.4MViTv2-B (IN-21K + Kinetics400 pretrain)
Activity RecognitionAVA v2.2mAP34.4MViTv2-L (IN21k, K700)
Object DetectionCOCO-OAverage mAP30.9MViTV2-H (Cascade Mask R-CNN)
Object DetectionCOCO-OEffective Robustness5.62MViTV2-H (Cascade Mask R-CNN)
Object DetectionCOCO minivalbox AP58.7MViTv2-L (Cascade Mask R-CNN, multi-scale, IN21k pre-train)
Object DetectionCOCO minivalbox AP56.1MViTv2-H (Cascade Mask R-CNN, single-scale, IN21k pre-train)
Object DetectionCOCO minivalbox AP54.3MViTv2-L (Cascade Mask R-CNN, single-scale)
Object DetectionCOCO minivalbox AP52.7MViT-L (Mask R-CNN, single-scale, IN21k pre-train)
Image ClassificationImageNetGFLOPs763.5MViTv2-H (512 res, ImageNet-21k pretrain)
Image ClassificationImageNetGFLOPs140.7MViTv2-L (384 res, ImageNet-21k pretrain)
Image ClassificationImageNetGFLOPs120.6MViTv2-H (mageNet-21k pretrain)
Image ClassificationImageNetGFLOPs140.2MViTv2-L (384 res)
Image ClassificationImageNetGFLOPs4.7MViTv2-T
3DCOCO-OAverage mAP30.9MViTV2-H (Cascade Mask R-CNN)
3DCOCO-OEffective Robustness5.62MViTV2-H (Cascade Mask R-CNN)
3DCOCO minivalbox AP58.7MViTv2-L (Cascade Mask R-CNN, multi-scale, IN21k pre-train)
3DCOCO minivalbox AP56.1MViTv2-H (Cascade Mask R-CNN, single-scale, IN21k pre-train)
3DCOCO minivalbox AP54.3MViTv2-L (Cascade Mask R-CNN, single-scale)
3DCOCO minivalbox AP52.7MViT-L (Mask R-CNN, single-scale, IN21k pre-train)
Instance SegmentationCOCO minivalmask AP50.5MViTv2-L (Cascade Mask R-CNN, multi-scale, IN21k pre-train)
Instance SegmentationCOCO minivalmask AP48.5MViTv2-H (Cascade Mask R-CNN, single-scale, IN21k pre-train)
Instance SegmentationCOCO minivalmask AP47.1MViTv2-L (Cascade Mask R-CNN, single-scale)
Instance SegmentationCOCO minivalmask AP46.2MViT-L (Mask R-CNN, single-scale)
Action RecognitionSomething-Something V2Parameters213.1MViTv2-L (IN-21K + Kinetics400 pretrain)
Action RecognitionSomething-Something V2Top-1 Accuracy73.3MViTv2-L (IN-21K + Kinetics400 pretrain)
Action RecognitionSomething-Something V2Top-5 Accuracy94.1MViTv2-L (IN-21K + Kinetics400 pretrain)
Action RecognitionSomething-Something V2Top-1 Accuracy72.1MViT-B (IN-21K + Kinetics400 pretrain)
Action RecognitionSomething-Something V2Parameters51.1MViTv2-B (IN-21K + Kinetics400 pretrain)
Action RecognitionSomething-Something V2Top-5 Accuracy93.4MViTv2-B (IN-21K + Kinetics400 pretrain)
Action RecognitionAVA v2.2mAP34.4MViTv2-L (IN21k, K700)
2D ClassificationCOCO-OAverage mAP30.9MViTV2-H (Cascade Mask R-CNN)
2D ClassificationCOCO-OEffective Robustness5.62MViTV2-H (Cascade Mask R-CNN)
2D ClassificationCOCO minivalbox AP58.7MViTv2-L (Cascade Mask R-CNN, multi-scale, IN21k pre-train)
2D ClassificationCOCO minivalbox AP56.1MViTv2-H (Cascade Mask R-CNN, single-scale, IN21k pre-train)
2D ClassificationCOCO minivalbox AP54.3MViTv2-L (Cascade Mask R-CNN, single-scale)
2D ClassificationCOCO minivalbox AP52.7MViT-L (Mask R-CNN, single-scale, IN21k pre-train)
2D Object DetectionCOCO-OAverage mAP30.9MViTV2-H (Cascade Mask R-CNN)
2D Object DetectionCOCO-OEffective Robustness5.62MViTV2-H (Cascade Mask R-CNN)
2D Object DetectionCOCO minivalbox AP58.7MViTv2-L (Cascade Mask R-CNN, multi-scale, IN21k pre-train)
2D Object DetectionCOCO minivalbox AP56.1MViTv2-H (Cascade Mask R-CNN, single-scale, IN21k pre-train)
2D Object DetectionCOCO minivalbox AP54.3MViTv2-L (Cascade Mask R-CNN, single-scale)
2D Object DetectionCOCO minivalbox AP52.7MViT-L (Mask R-CNN, single-scale, IN21k pre-train)
16kCOCO-OAverage mAP30.9MViTV2-H (Cascade Mask R-CNN)
16kCOCO-OEffective Robustness5.62MViTV2-H (Cascade Mask R-CNN)
16kCOCO minivalbox AP58.7MViTv2-L (Cascade Mask R-CNN, multi-scale, IN21k pre-train)
16kCOCO minivalbox AP56.1MViTv2-H (Cascade Mask R-CNN, single-scale, IN21k pre-train)
16kCOCO minivalbox AP54.3MViTv2-L (Cascade Mask R-CNN, single-scale)
16kCOCO minivalbox AP52.7MViT-L (Mask R-CNN, single-scale, IN21k pre-train)

Related Papers

Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17Federated Learning for Commercial Image Sources2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17RS-TinyNet: Stage-wise Feature Fusion Network for Detecting Tiny Objects in Remote Sensing Images2025-07-17