TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Multiscale Vision Transformers

Multiscale Vision Transformers

Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, Christoph Feichtenhofer

2021-04-22ICCV 2021 10Image ClassificationAction ClassificationVideo RecognitionAction Recognition
PaperPDFCodeCode(official)CodeCodeCodeCodeCodeCode

Abstract

We present Multiscale Vision Transformers (MViT) for video and image recognition, by connecting the seminal idea of multiscale feature hierarchies with transformer models. Multiscale Transformers have several channel-resolution scale stages. Starting from the input resolution and a small channel dimension, the stages hierarchically expand the channel capacity while reducing the spatial resolution. This creates a multiscale pyramid of features with early layers operating at high spatial resolution to model simple low-level visual information, and deeper layers at spatially coarse, but complex, high-dimensional features. We evaluate this fundamental architectural prior for modeling the dense nature of visual signals for a variety of video recognition tasks where it outperforms concurrent vision transformers that rely on large scale external pre-training and are 5-10x more costly in computation and parameters. We further remove the temporal dimension and apply our model for image classification where it outperforms prior work on vision transformers. Code is available at: https://github.com/facebookresearch/SlowFast

Results

TaskDatasetMetricValueModel
VideoCharadesMAP47.7MViT-B-24, 32x3 (Kinetics-600 pretraining)
VideoCharadesMAP47.1MViT-B, 32x3 (Kinetics-600 pretraining)
VideoCharadesMAP46.3MViT-B-24, 32x3 (Kinetics-400 pretraining)
VideoCharadesMAP44.3MViT-B, 32x3 (Kinetics-400 pretraining)
VideoCharadesMAP43.9MViT-B, 16x4 (Kinetics-600 pretraining)
VideoCharadesMAP40MViT-B, 16x4 (Kinetics-400 pretraining)
VideoKinetics-400Acc@181.2MViT-B, 64x3
VideoKinetics-400Acc@595.1MViT-B, 64x3
VideoKinetics-400Acc@180.2MViT-B, 32x3
VideoKinetics-400Acc@594.4MViT-B, 32x3
VideoKinetics-400Acc@178.4MViT-B, 16x4
VideoKinetics-400Acc@593.5MViT-B, 16x4
VideoKinetics-400Acc@176MViT-S
VideoKinetics-400Acc@592.1MViT-S
VideoKinetics-600Top-1 Accuracy83.8MViT-B-24, 32x3
VideoKinetics-600Top-5 Accuracy96.3MViT-B-24, 32x3
VideoKinetics-600Top-1 Accuracy83.4MViT-B, 32x3
VideoKinetics-600Top-5 Accuracy96.3MViT-B, 32x3
VideoKinetics-600Top-1 Accuracy82.1MViT-B, 16x4
VideoKinetics-600Top-5 Accuracy95.7MViT-B, 16x4
Activity RecognitionSomething-Something V2Top-1 Accuracy68.7MViT-B-24, 32x3
Activity RecognitionSomething-Something V2Top-5 Accuracy91.5MViT-B-24, 32x3
Activity RecognitionSomething-Something V2Parameters36.6MViT-B, 32x3(Kinetics600 pretrain)
Activity RecognitionSomething-Something V2Top-1 Accuracy67.8MViT-B, 32x3(Kinetics600 pretrain)
Activity RecognitionSomething-Something V2Top-5 Accuracy91.3MViT-B, 32x3(Kinetics600 pretrain)
Activity RecognitionSomething-Something V2Top-1 Accuracy66.2MViT-B, 16x4
Activity RecognitionSomething-Something V2Top-5 Accuracy90.2MViT-B, 16x4
Activity RecognitionAVA v2.2mAP28.7MViT-B-24, 32x3 (Kinetics-600 pretraining)
Activity RecognitionAVA v2.2mAP27.5MViT-B, 32x3 (Kinetics-500 pretraining)
Activity RecognitionAVA v2.2mAP27.3MViT-B, 64x3 (Kinetics-400 pretraining)
Activity RecognitionAVA v2.2mAP26.8MViT-B, 32x3 (Kinetics-400 pretraining)
Activity RecognitionAVA v2.2mAP26.1MViT-B, 16x4 (Kinetics-600 pretraining)
Activity RecognitionAVA v2.2mAP24.5MViT-B, 16x4 (Kinetics-400 pretraining)
Image ClassificationImageNetGFLOPs32.7MViT-B-24
Image ClassificationImageNetGFLOPs7.8MViT-B-16
Action RecognitionSomething-Something V2Top-1 Accuracy68.7MViT-B-24, 32x3
Action RecognitionSomething-Something V2Top-5 Accuracy91.5MViT-B-24, 32x3
Action RecognitionSomething-Something V2Parameters36.6MViT-B, 32x3(Kinetics600 pretrain)
Action RecognitionSomething-Something V2Top-1 Accuracy67.8MViT-B, 32x3(Kinetics600 pretrain)
Action RecognitionSomething-Something V2Top-5 Accuracy91.3MViT-B, 32x3(Kinetics600 pretrain)
Action RecognitionSomething-Something V2Top-1 Accuracy66.2MViT-B, 16x4
Action RecognitionSomething-Something V2Top-5 Accuracy90.2MViT-B, 16x4
Action RecognitionAVA v2.2mAP28.7MViT-B-24, 32x3 (Kinetics-600 pretraining)
Action RecognitionAVA v2.2mAP27.5MViT-B, 32x3 (Kinetics-500 pretraining)
Action RecognitionAVA v2.2mAP27.3MViT-B, 64x3 (Kinetics-400 pretraining)
Action RecognitionAVA v2.2mAP26.8MViT-B, 32x3 (Kinetics-400 pretraining)
Action RecognitionAVA v2.2mAP26.1MViT-B, 16x4 (Kinetics-600 pretraining)
Action RecognitionAVA v2.2mAP24.5MViT-B, 16x4 (Kinetics-400 pretraining)

Related Papers

Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17Federated Learning for Commercial Image Sources2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16Hashed Watermark as a Filter: Defeating Forging and Overwriting Attacks in Weight-based Neural Network Watermarking2025-07-15