| 1 | TubeViT-H (ImageNet-1k) | 98.9 | No | Rethinking Video ViTs: Sparse Video Tubes for Jo... | 2022-12-06 | Code |
| 2 | Unmasked Teacher (ViT-L) | 98.7 | No | Unmasked Teacher: Towards Training-Efficient Vid... | 2023-03-28 | Code |
| 3 | UMT-L (ViT-L/16) | 98.7 | No | Unmasked Teacher: Towards Training-Efficient Vid... | 2023-03-28 | Code |
| 4 | TubeVit-L (ImageNet-1k) | 98.6 | No | Rethinking Video ViTs: Sparse Video Tubes for Jo... | 2022-12-06 | Code |
| 5 | UniFormerV2-L (ViT-L, 336) | 98.4 | Yes | - | - | Code |
| 6 | VideoMAE V2-g (64x266x266) | 98.4 | No | VideoMAE V2: Scaling Video Masked Autoencoders w... | 2023-03-29 | Code |
| 7 | BIKE (CLIP ViT-L/14) | 98.4 | No | Bidirectional Cross-Modal Knowledge Exploration ... | 2022-12-31 | Code |
| 8 | MTV-H (WTS 60M) | 98.3 | No | Multiview Transformers for Video Recognition | 2022-01-12 | Code |
| 9 | ATM | 98.3 | No | What Can Simple Arithmetic Operations Do for Tem... | 2023-07-18 | Code |
| 10 | DejaVid | 98.2 | Yes | - | - | Code |
| 11 | Side4Video (EVA, ViT-E/14) | 98.2 | No | Side4Video: Spatial-Temporal Side Network for Me... | 2023-11-27 | Code |
| 12 | VideoMAE V2-g | 98.1 | Yes | VideoMAE V2: Scaling Video Masked Autoencoders w... | 2023-03-29 | Code |
| 13 | ILA (ViT-L/14) | 97.8 | No | Implicit Temporal Modeling with Learnable Alignm... | 2023-04-20 | Code |
| 14 | ONE-PEACE | 97.8 | No | ONE-PEACE: Exploring One General Representation ... | 2023-05-18 | Code |
| 15 | EVL (CLIP ViT-L/14@336px, frozen, 32 frames) | 97.8 | No | Frozen CLIP Models are Efficient Video Learners | 2022-08-06 | Code |
| 16 | DualPath w/ ViT-L/14 | 97.8 | No | Dual-path Adaptation from Image to Video Transfo... | 2023-03-17 | Code |
| 17 | AIM (CLIP ViT-L/14, 32x224) | 97.7 | Yes | AIM: Adapting Image Models for Efficient Video A... | 2023-02-06 | Code |
| 18 | mPLUG-2 | 97.7 | No | mPLUG-2: A Modularized Multi-modal Foundation Mo... | 2023-02-01 | Code |
| 19 | TubeVit-B (ImageNet-1k) | 97.6 | No | Rethinking Video ViTs: Sparse Video Tubes for Jo... | 2022-12-06 | Code |
| 20 | Text4Vis (CLIP ViT-L/14) | 97.6 | No | Revisiting Classifier: Transferring Vision-Langu... | 2022-07-04 | Code |
| 21 | VideoMAE (no extra data, ViT-H, 32x320x320) | 97.6 | No | VideoMAE: Masked Autoencoders are Data-Efficient... | 2022-03-23 | Code |
| 22 | ST-Adapter (ViT-L, CLIP) | 97.6 | No | ST-Adapter: Parameter-Efficient Image-to-Video T... | 2022-06-27 | Code |
| 23 | ZeroI2V ViT-L/14 | 97.6 | No | ZeroI2V: Zero-Cost Adaptation of Pre-trained Tra... | 2023-10-02 | Code |
| 24 | CoVeR (JFT-3B) | 97.5 | No | Co-training Transformer with Videos and Images I... | 2021-12-14 | - |
| 25 | X-CLIP(ViT-L/14, CLIP) | 97.4 | No | Expanding Language-Image Pretrained Models for G... | 2022-08-04 | Code |
| 26 | MVD (K400 pretrain, ViT-H, 16x224x224) | 97.4 | No | Masked Video Distillation: Rethinking Masked Fea... | 2022-12-08 | Code |
| 27 | MaskFeat (K600, MViT-L) | 97.4 | No | Masked Feature Prediction for Self-Supervised Vi... | 2021-12-16 | Code |
| 28 | MaskFeat (no extra data, MViT-L) | 97.3 | No | Masked Feature Prediction for Self-Supervised Vi... | 2021-12-16 | Code |
| 29 | VideoMAE (no extra data, ViT-L, 32x320x320) | 97.3 | No | VideoMAE: Masked Autoencoders are Data-Efficient... | 2022-03-23 | Code |
| 30 | CoVeR (JFT-300M) | 97.2 | No | Co-training Transformer with Videos and Images I... | 2021-12-14 | - |
| 31 | ILA (ViT-B/16) | 97.2 | No | Implicit Temporal Modeling with Learnable Alignm... | 2023-04-20 | Code |
| 32 | VideoMAE (no extra data, ViT-H) | 97.1 | No | VideoMAE: Masked Autoencoders are Data-Efficient... | 2022-03-23 | Code |
| 33 | DualPath w/ ViT-B/16 | 97.1 | No | Dual-path Adaptation from Image to Video Transfo... | 2023-03-17 | Code |
| 34 | ActionCLIP (CLIP-pretrained) | 97.1 | No | ActionCLIP: A New Paradigm for Video Action Reco... | 2021-09-17 | Code |
| 35 | MVD (K400 pretrain, ViT-L, 16x224x224) | 97 | No | Masked Video Distillation: Rethinking Masked Fea... | 2022-12-08 | Code |
| 36 | MViTv2-L (ImageNet-21k pretrain) | 97 | Yes | MViTv2: Improved Multiscale Vision Transformers ... | 2021-12-02 | Code |
| 37 | VideoMAE (no extra data, ViT-L, 16x4) | 96.8 | No | VideoMAE: Masked Autoencoders are Data-Efficient... | 2022-03-23 | Code |
| 38 | Swin-L (384x384, ImageNet-21k pretrain) | 96.7 | No | Video Swin Transformer | 2021-06-24 | Code |
| 39 | MAR (50% mask, ViT-L, 16x4) | 96.3 | No | MAR: Masked Autoencoders for Efficient Action Re... | 2022-07-24 | Code |
| 40 | OMNIVORE (Swin-B) | 96.2 | No | Omnivore: A Single Model for Many Visual Modalit... | 2022-01-20 | Code |
| 41 | OMNIVORE (Swin-L) | 96.1 | No | Omnivore: A Single Model for Many Visual Modalit... | 2022-01-20 | Code |
| 42 | MAR (75% mask, ViT-L, 16x4) | 96 | No | MAR: Masked Autoencoders for Efficient Action Re... | 2022-07-24 | Code |
| 43 | Swin-L (ImageNet-21k pretrain) | 95.9 | No | Video Swin Transformer | 2021-06-24 | Code |
| 44 | ViViT-H/16x2 (JFT) | 95.8 | No | ViViT: A Video Vision Transformer | 2021-03-29 | Code |
| 45 | MVD (K400 pretrain, ViT-B, 16x224x224) | 95.8 | No | Masked Video Distillation: Rethinking Masked Fea... | 2022-12-08 | Code |
| 46 | ILA (ViT-B/32) | 95.8 | No | Implicit Temporal Modeling with Learnable Alignm... | 2023-04-20 | Code |
| 47 | Swin-B (ImageNet-21k pretrain) | 95.5 | No | Video Swin Transformer | 2021-06-24 | Code |
| 48 | VATT-Large | 95.5 | No | VATT: Transformers for Multimodal Self-Supervise... | 2021-04-22 | Code |
| 49 | ip-CSN-152 (IG-65M pretraining) | 95.3 | No | Video Classification with Channel-Separated Conv... | 2019-04-04 | Code |
| 50 | AMD(ViT-B/16) | 95.3 | No | Asymmetric Masked Distillation for Pre-Training ... | 2023-11-06 | - |
| 51 | AdaMAE | 95.2 | No | AdaMAE: Adaptive Masking for Efficient Spatiotem... | 2022-11-16 | Code |
| 52 | LGD-3D Two-stream (ResNet-101) | 95.2 | No | Learning Spatio-Temporal Representation with Loc... | 2019-06-13 | - |
| 53 | Motionformer-HR | 95.2 | No | Keeping Your Eye on the Ball: Trajectory Attenti... | 2021-06-09 | Code |
| 54 | VideoMAE (no extra data, ViT-B, 16x4) | 95.1 | No | VideoMAE: Masked Autoencoders are Data-Efficient... | 2022-03-23 | Code |
| 55 | R[2+1]D-152 (IG-65M pretraining) | 95.1 | No | Video Classification with Channel-Separated Conv... | 2019-04-04 | Code |
| 56 | MViT-B, 64x3 | 95.1 | No | Multiscale Vision Transformers | 2021-04-22 | Code |
| 57 | MoViNet-A5 | 94.9 | No | MoViNets: Mobile Video Networks for Efficient Vi... | 2021-03-21 | Code |
| 58 | DirecFormer | 94.86 | No | DirecFormer: A Directed Attention in Transformer... | 2022-03-19 | Code |
| 59 | MVD (K400 pretrain, ViT-S, 16x224x224) | 94.8 | No | Masked Video Distillation: Rethinking Masked Fea... | 2022-12-08 | Code |
| 60 | TimeSformer-L | 94.7 | No | Is Space-Time Attention All You Need for Video U... | 2021-02-09 | Code |
| 61 | ViViT-L/16x2 320 | 94.7 | No | ViViT: A Video Vision Transformer | 2021-03-29 | Code |
| 62 | MBT (AV) | 94.6 | No | Attention Bottlenecks for Multimodal Fusion | 2021-06-30 | Code |
| 63 | Swin-B (ImageNet-1k pretrain) | 94.6 | No | Video Swin Transformer | 2021-06-24 | Code |
| 64 | En-VidTr-L | 94.6 | No | VidTr: Video Transformer Without Convolutions | 2021-04-23 | - |
| 65 | X3D-XXL | 94.6 | No | X3D: Expanding Architectures for Efficient Video... | 2020-04-09 | Code |
| 66 | UniFormer-B (ImageNet-1K) | 94.5 | No | - | - | Code |
| 67 | Swin-S (ImageNet-1k pretrain) | 94.5 | No | Video Swin Transformer | 2021-06-24 | Code |
| 68 | MoViNet-A4 | 94.5 | No | MoViNets: Mobile Video Networks for Efficient Vi... | 2021-03-21 | Code |
| 69 | AMD(ViT-S/16) | 94.5 | No | Asymmetric Masked Distillation for Pre-Training ... | 2023-11-06 | - |
| 70 | OmniVL | 94.5 | No | OmniVL:One Foundation Model for Image-Language a... | 2022-09-15 | - |
| 71 | MAR (50% mask, ViT-B, 16x4) | 94.4 | No | MAR: Masked Autoencoders for Efficient Action Re... | 2022-07-24 | Code |
| 72 | OmniSource SlowOnly R101 8x8(ImageNet pretrain) | 94.4 | No | Omni-sourced Webly-supervised Learning for Video... | 2020-03-29 | Code |
| 73 | R3D-RS-200 | 94.4 | No | Revisiting 3D ResNets for Video Recognition | 2021-09-03 | Code |
| 74 | OmniSource SlowOnly R101 8x8 (Scratch) | 94.4 | No | Omni-sourced Webly-supervised Learning for Video... | 2020-03-29 | Code |
| 75 | MViT-B, 32x3 | 94.4 | No | Multiscale Vision Transformers | 2021-04-22 | Code |
| 76 | TimeSformer-HR | 94.4 | No | Is Space-Time Attention All You Need for Video U... | 2021-02-09 | Code |
| 77 | LGD-3D RGB (ResNet-101) | 94.4 | No | Learning Spatio-Temporal Representation with Loc... | 2019-06-13 | - |
| 78 | TDN-ResNet101 (ensemble, ImageNet pretrained, RGB only) | 94.4 | No | TDN: Temporal Difference Networks for Efficient ... | 2020-12-18 | Code |
| 79 | En-VidTr-M | 94.2 | No | VidTr: Video Transformer Without Convolutions | 2021-04-23 | - |
| 80 | ViT-B-VTN+ ImageNet-21K (84.0 [10]) | 94.2 | No | Video Transformer Network | 2021-02-01 | Code |
| 81 | En-VidTr-S | 94 | No | VidTr: Video Transformer Without Convolutions | 2021-04-23 | - |
| 82 | X3D-XL | 93.9 | No | X3D: Expanding Architectures for Efficient Video... | 2020-04-09 | Code |
| 83 | SlowFast 16x8 (ResNet-101 + NL) | 93.9 | No | SlowFast Networks for Video Recognition | 2018-12-10 | Code |
| 84 | ip-CSN-152 (Sports-1M pretraining) | 93.8 | No | Video Classification with Channel-Separated Conv... | 2019-04-04 | Code |
| 85 | MVFNet-ResNet101 (ensemble, ImageNet pretrained, RGB only) | 93.8 | No | MVFNet: Multi-View Fusion Network for Efficient ... | 2020-12-13 | Code |
| 86 | MoViNet-A3 | 93.8 | No | MoViNets: Mobile Video Networks for Efficient Vi... | 2021-03-21 | Code |
| 87 | MAR (75% mask, ViT-B, 16x4) | 93.7 | No | MAR: Masked Autoencoders for Efficient Action Re... | 2022-07-24 | Code |
| 88 | TAdaConvNeXt-T | 93.7 | No | TAda! Temporally-Adaptive Convolutions for Video... | 2021-10-12 | Code |
| 89 | ViT-B-VTN (3 layers, ImageNet pretrain) | 93.7 | No | Video Transformer Network | 2021-02-01 | Code |
| 90 | TimeSformer | 93.7 | No | Is Space-Time Attention All You Need for Video U... | 2021-02-09 | Code |
| 91 | Swin-T (ImageNet-1k pretrain) | 93.6 | No | Video Swin Transformer | 2021-06-24 | Code |
| 92 | SlowFast 16x8 (ResNet-101) | 93.5 | No | SlowFast Networks for Video Recognition | 2018-12-10 | Code |
| 93 | MViT-B, 16x4 | 93.5 | No | Multiscale Vision Transformers | 2021-04-22 | Code |
| 94 | TAda2D-En (ResNet-50, 8+16 frames) | 93.5 | No | TAda! Temporally-Adaptive Convolutions for Video... | 2021-10-12 | Code |
| 95 | S3D-G (RGB, ImageNet pretrained) | 93.4 | No | Rethinking Spatiotemporal Feature Learning: Spee... | 2017-12-13 | Code |
| 96 | ViT-B-VTN (1 layer, ImageNet pretrain) | 93.4 | No | Video Transformer Network | 2021-02-01 | Code |
| 97 | I3D + NL | 93.3 | No | Non-local Neural Networks | 2017-11-21 | Code |
| 98 | SlowFast 8x8 (ResNet-101) | 93.2 | No | SlowFast Networks for Video Recognition | 2018-12-10 | Code |
| 99 | BQN (ResNet-50) | 93.2 | No | Busy-Quiet Video Disentangling for Video Classif... | 2021-03-29 | Code |
| 100 | TAda2D (ResNet-50, 16 frames) | 93.1 | No | TAda! Temporally-Adaptive Convolutions for Video... | 2021-10-12 | Code |
| 101 | S3D-G (RGB+Flow, ImageNet pretrained) | 93 | No | Rethinking Spatiotemporal Feature Learning: Spee... | 2017-12-13 | Code |
| 102 | X3D-L | 92.9 | No | X3D: Expanding Architectures for Efficient Video... | 2020-04-09 | Code |
| 103 | ip-CSN-152 | 92.8 | No | Video Classification with Channel-Separated Conv... | 2019-04-04 | Code |
| 104 | SlowFast 8x8 (ResNet-50) | 92.6 | No | SlowFast Networks for Video Recognition | 2018-12-10 | Code |
| 105 | TAda2D (ResNet-50, 8 frames) | 92.6 | No | TAda! Temporally-Adaptive Convolutions for Video... | 2021-10-12 | Code |
| 106 | X3D-M | 92.3 | No | X3D: Expanding Architectures for Efficient Video... | 2020-04-09 | Code |
| 107 | MoViNet-A2 | 92.3 | No | MoViNets: Mobile Video Networks for Efficient Vi... | 2021-03-21 | Code |
| 108 | MViT-S | 92.1 | No | Multiscale Vision Transformers | 2021-04-22 | Code |
| 109 | SlowFast 4x16 (ResNet-50) | 92.1 | No | SlowFast Networks for Video Recognition | 2018-12-10 | Code |
| 110 | R[2+1]D-Flow (Sports-1M pretrain) | 91.9 | No | A Closer Look at Spatiotemporal Convolutions for... | 2017-11-30 | Code |
| 111 | A2 Net | 91.5 | No | $A^2$-Nets: Double Attention Networks | 2018-10-27 | - |
| 112 | R[2+1]D-RGB (Sports-1M pretrain) | 91.4 | No | A Closer Look at Spatiotemporal Convolutions for... | 2017-11-30 | Code |
| 113 | bLVNet Fan et al. (2019) | 91.2 | No | More Is Less: Learning Efficient Video Represent... | 2019-12-02 | Code |
| 114 | MoViNet-A1 | 91.2 | No | MoViNets: Mobile Video Networks for Efficient Vi... | 2021-03-21 | Code |
| 115 | TSN | 91.1 | No | Temporal Segment Networks: Towards Good Practice... | 2016-08-02 | Code |
| 116 | R[2+1]D-Two-Stream | 90.9 | No | A Closer Look at Spatiotemporal Convolutions for... | 2017-11-30 | Code |
| 117 | Inception-ResNet | 90.9 | No | Revisiting the Effectiveness of Off-the-shelf Te... | 2017-08-12 | - |
| 118 | LGD-3D Flow (ResNet-101) | 90.9 | No | Learning Spatio-Temporal Representation with Loc... | 2019-06-13 | - |
| 119 | MFNet | 90.4 | No | Multi-Fiber Networks for Video Recognition | 2018-07-30 | - |
| 120 | ARTNet | 90.4 | No | Appearance-and-Relation Networks for Video Class... | 2017-11-24 | Code |
| 121 | R[2+1]D | 90 | No | A Closer Look at Spatiotemporal Convolutions for... | 2017-11-30 | Code |
| 122 | R[2+1]D-RGB | 90 | No | A Closer Look at Spatiotemporal Convolutions for... | 2017-11-30 | Code |
| 123 | I3D | 89.3 | No | Quo Vadis, Action Recognition? A New Model and t... | 2017-05-22 | Code |
| 124 | S3D-G (Flow, ImageNet pretrained) | 87.6 | No | Rethinking Spatiotemporal Feature Learning: Spee... | 2017-12-13 | Code |
| 125 | MoViNet-A0 | 87.4 | No | MoViNets: Mobile Video Networks for Efficient Vi... | 2021-03-21 | Code |
| 126 | R[2+1]D-Flow | 87.2 | No | A Closer Look at Spatiotemporal Convolutions for... | 2017-11-30 | Code |