| 1 | OmniVec2 | 93.6 | No | - | - | - |
| 2 | FTP-UniFormerV2-L/14 | 93.4 | No | Enhancing Video Transformers for Action Understa... | 2024-03-24 | - |
| 3 | InternVideo2-6B | 92.1 | Yes | InternVideo2: Scaling Foundation Models for Mult... | 2024-03-22 | Code |
| 4 | InternVideo2-1B | 91.6 | Yes | InternVideo2: Scaling Foundation Models for Mult... | 2024-03-22 | Code |
| 5 | InternVideo | 91.1 | No | InternVideo: General Video Foundation Models via... | 2022-12-06 | Code |
| 6 | OmniVec | 91.1 | No | OmniVec: Learning robust representations with cr... | 2023-11-07 | - |
| 7 | TubeViT-H (ImageNet-1k) | 90.9 | No | Rethinking Video ViTs: Sparse Video Tubes for Jo... | 2022-12-06 | Code |
| 8 | Unmasked Teacher (ViT-L) | 90.6 | No | Unmasked Teacher: Towards Training-Efficient Vid... | 2023-03-28 | Code |
| 9 | UMT-L (ViT-L/16) | 90.6 | No | Unmasked Teacher: Towards Training-Efficient Vid... | 2023-03-28 | Code |
| 10 | TubeVit-L (ImageNet-1k) | 90.2 | No | Rethinking Video ViTs: Sparse Video Tubes for Jo... | 2022-12-06 | Code |
| 11 | UniFormerV2-L (ViT-L, 336) | 90 | Yes | - | - | Code |
| 12 | VideoMAE V2-g (64x266x266) | 90 | No | VideoMAE V2: Scaling Video Masked Autoencoders w... | 2023-03-29 | Code |
| 13 | FluxViT-B | 90 | Yes | Make Your Training Flexible: Towards Deployment-... | 2025-03-18 | Code |
| 14 | MTV-H (WTS 60M) | 89.9 | No | Multiview Transformers for Video Recognition | 2022-01-12 | Code |
| 15 | TAdaFormer-L/14 | 89.9 | No | Temporally-Adaptive Models for Efficient Video U... | 2023-08-10 | Code |
| 16 | EVA | 89.7 | No | EVA: Exploring the Limits of Masked Visual Repre... | 2022-11-14 | Code |
| 17 | AM/12 ViT-B Dinov2 | 89.6 | No | AM Flow: Adapters for Temporal Processing in Act... | 2024-11-04 | - |
| 18 | ATM | 89.4 | No | What Can Simple Arithmetic Operations Do for Tem... | 2023-07-18 | Code |
| 19 | DejaVid | 89.1 | Yes | - | - | Code |
| 20 | CoCa (finetuned) | 88.9 | No | CoCa: Contrastive Captioners are Image-Text Foun... | 2022-05-04 | Code |
| 21 | BIKE (CLIP ViT-L/14) | 88.7 | No | Bidirectional Cross-Modal Knowledge Exploration ... | 2022-12-31 | Code |
| 22 | ILA (ViT-L/14) | 88.7 | No | Implicit Temporal Modeling with Learnable Alignm... | 2023-04-20 | Code |
| 23 | Side4Video (EVA, ViT-E/14) | 88.6 | No | Side4Video: Spatial-Temporal Side Network for Me... | 2023-11-27 | Code |
| 24 | TubeVit-B (ImageNet-1k) | 88.6 | No | Rethinking Video ViTs: Sparse Video Tubes for Jo... | 2022-12-06 | Code |
| 25 | VideoMAE V2-g | 88.5 | Yes | VideoMAE V2: Scaling Video Masked Autoencoders w... | 2023-03-29 | Code |
| 26 | ONE-PEACE | 88.1 | No | ONE-PEACE: Exploring One General Representation ... | 2023-05-18 | Code |
| 27 | FluxViT-S | 88 | Yes | Make Your Training Flexible: Towards Deployment-... | 2025-03-18 | Code |
| 28 | CoCa (frozen) | 88 | No | CoCa: Contrastive Captioners are Image-Text Foun... | 2022-05-04 | Code |
| 29 | ViT-22B | 88 | No | Scaling Vision Transformers to 22 Billion Parame... | 2023-02-10 | Code |
| 30 | Text4Vis (CLIP ViT-L/14) | 87.8 | No | Revisiting Classifier: Transferring Vision-Langu... | 2022-07-04 | Code |
| 31 | Hiera-H (no extra data) | 87.8 | No | Hiera: A Hierarchical Vision Transformer without... | 2023-06-01 | Code |
| 32 | EVL (CLIP ViT-L/14@336px, frozen, 32 frames) | 87.7 | No | Frozen CLIP Models are Efficient Video Learners | 2022-08-06 | Code |
| 33 | DualPath w/ ViT-L/14 | 87.7 | No | Dual-path Adaptation from Image to Video Transfo... | 2023-03-17 | Code |
| 34 | X-CLIP(ViT-L/14, CLIP) | 87.7 | No | Expanding Language-Image Pretrained Models for G... | 2022-08-04 | Code |
| 35 | AIM (CLIP ViT-L/14, 32x224) | 87.5 | Yes | AIM: Adapting Image Models for Efficient Video A... | 2023-02-06 | Code |
| 36 | VideoMAE (no extra data, ViT-H, 32x320x320) | 87.4 | No | VideoMAE: Masked Autoencoders are Data-Efficient... | 2022-03-23 | Code |
| 37 | ST-Adapter (ViT-L, CLIP) | 87.2 | No | ST-Adapter: Parameter-Efficient Image-to-Video T... | 2022-06-27 | Code |
| 38 | ZeroI2V ViT-L/14 | 87.2 | No | ZeroI2V: Zero-Cost Adaptation of Pre-trained Tra... | 2023-10-02 | Code |
| 39 | CoVeR (JFT-3B) | 87.2 | No | Co-training Transformer with Videos and Images I... | 2021-12-14 | - |
| 40 | MVD (K400 pretrain, ViT-H, 16x224x224) | 87.2 | No | Masked Video Distillation: Rethinking Masked Fea... | 2022-12-08 | Code |
| 41 | mPLUG-2 | 87.1 | No | mPLUG-2: A Modularized Multi-modal Foundation Mo... | 2023-02-01 | Code |
| 42 | MaskFeat (K600, MViT-L) | 87 | No | Masked Feature Prediction for Self-Supervised Vi... | 2021-12-16 | Code |
| 43 | VicTR (ViT-L/14) | 87 | No | VicTR: Video-conditioned Text Representations fo... | 2023-04-05 | - |
| 44 | Video-SwinV2-G (ImageNet-22k and external 70M pretrain) | 86.8 | No | Swin Transformer V2: Scaling Up Capacity and Res... | 2021-11-18 | Code |
| 45 | MaskFeat (no extra data, MViT-L) | 86.7 | No | Masked Feature Prediction for Self-Supervised Vi... | 2021-12-16 | Code |
| 46 | VideoMAE (no extra data, ViT-H) | 86.6 | No | VideoMAE: Masked Autoencoders are Data-Efficient... | 2022-03-23 | Code |
| 47 | MVD (K400 pretrain, ViT-L, 16x224x224) | 86.4 | No | Masked Video Distillation: Rethinking Masked Fea... | 2022-12-08 | Code |
| 48 | TAdaConvNeXtV2-B | 86.4 | No | Temporally-Adaptive Models for Efficient Video U... | 2023-08-10 | Code |
| 49 | CoVeR (JFT-300M) | 86.3 | No | Co-training Transformer with Videos and Images I... | 2021-12-14 | - |
| 50 | VideoMAE (no extra data, ViT-L, 32x320x320) | 86.1 | No | VideoMAE: Masked Autoencoders are Data-Efficient... | 2022-03-23 | Code |
| 51 | MViTv2-L (ImageNet-21k pretrain) | 86.1 | Yes | MViTv2: Improved Multiscale Vision Transformers ... | 2021-12-02 | Code |
| 52 | ILA (ViT-B/16) | 85.7 | No | Implicit Temporal Modeling with Learnable Alignm... | 2023-04-20 | Code |
| 53 | DualPath w/ ViT-B/16 | 85.4 | No | Dual-path Adaptation from Image to Video Transfo... | 2023-03-17 | Code |
| 54 | TokenLearner 16at18 (L/10) | 85.4 | No | TokenLearner: What Can 8 Learned Tokens Do for I... | 2021-06-21 | Code |
| 55 | MAR (50% mask, ViT-L, 16x4) | 85.3 | No | MAR: Masked Autoencoders for Efficient Action Re... | 2022-07-24 | Code |
| 56 | CAST(ViT-B/16) | 85.3 | No | CAST: Cross-Attention in Space and Time for Vide... | 2023-11-30 | Code |
| 57 | VideoMAE (no extra data, ViT-L, 16x4) | 85.2 | No | VideoMAE: Masked Autoencoders are Data-Efficient... | 2022-03-23 | Code |
| 58 | ViC-MAE (ViT-L) | 85.1 | No | ViC-MAE: Self-Supervised Representation Learning... | 2023-03-21 | Code |
| 59 | VideoMamba-M800 | 85 | No | VideoMamba: State Space Model for Efficient Vide... | 2024-03-11 | Code |
| 60 | Swin-L (384x384, ImageNet-21k pretrain) | 84.9 | No | Video Swin Transformer | 2021-06-24 | Code |
| 61 | ViViT-H/16x2 (JFT) | 84.9 | No | ViViT: A Video Vision Transformer | 2021-03-29 | Code |
| 62 | OMNIVORE (Swin-L) | 84.1 | No | Omnivore: A Single Model for Many Visual Modalit... | 2022-01-20 | Code |
| 63 | OMNIVORE (Swin-B) | 84 | No | Omnivore: A Single Model for Many Visual Modalit... | 2022-01-20 | Code |
| 64 | MAR (75% mask, ViT-L, 16x4) | 83.9 | No | MAR: Masked Autoencoders for Efficient Action Re... | 2022-07-24 | Code |
| 65 | ActionCLIP (CLIP-pretrained) | 83.8 | No | ActionCLIP: A New Paradigm for Video Action Reco... | 2021-09-17 | Code |
| 66 | OmniSource irCSN-152 (IG-Kinetics-65M pretrain) | 83.6 | No | Omni-sourced Webly-supervised Learning for Video... | 2020-03-29 | Code |
| 67 | MVD (K400 pretrain, ViT-B, 16x224x224) | 83.4 | No | Masked Video Distillation: Rethinking Masked Fea... | 2022-12-08 | Code |
| 68 | StructViT-B-4-1 | 83.4 | No | Learning Correlation Structures for Vision Trans... | 2024-04-05 | - |
| 69 | Swin-L (ImageNet-21k pretrain) | 83.1 | No | Video Swin Transformer | 2021-06-24 | Code |
| 70 | SIFA | 83.1 | No | Stand-Alone Inter-Frame Attention in Video Models | 2022-06-14 | Code |
| 71 | UniFormer-B (ImageNet-1K) | 82.9 | No | - | - | Code |
| 72 | irCSN-152 (IG-Kinetics-65M pretrain) | 82.8 | No | Large-scale weakly-supervised pre-training for v... | 2019-05-02 | Code |
| 73 | DirecFormer | 82.75 | No | DirecFormer: A Directed Attention in Transformer... | 2022-03-19 | Code |
| 74 | Swin-B (ImageNet-21k pretrain) | 82.7 | No | Video Swin Transformer | 2021-06-24 | Code |
| 75 | ir-CSN-152 (IG-65M pretraining) | 82.6 | No | Video Classification with Channel-Separated Conv... | 2019-04-04 | Code |
| 76 | ip-CSN-152 (IG-65M pretraining) | 82.5 | No | Video Classification with Channel-Separated Conv... | 2019-04-04 | Code |
| 77 | TPS | 82.5 | No | Spatiotemporal Self-attention Modeling with Temp... | 2022-07-27 | Code |
| 78 | ILA (ViT-B/32) | 82.4 | No | Implicit Temporal Modeling with Learnable Alignm... | 2023-04-20 | Code |
| 79 | AMD(ViT-B/16) | 82.2 | No | Asymmetric Masked Distillation for Pre-Training ... | 2023-11-06 | - |
| 80 | VATT-Large | 82.1 | No | VATT: Transformers for Multimodal Self-Supervise... | 2021-04-22 | Code |
| 81 | AdaMAE | 81.7 | No | AdaMAE: Adaptive Masking for Efficient Spatiotem... | 2022-11-16 | Code |
| 82 | VideoMAE (no extra data, ViT-B, 16x4) | 81.5 | No | VideoMAE: Masked Autoencoders are Data-Efficient... | 2022-03-23 | Code |
| 83 | MoViNet-A6 | 81.5 | No | MoViNets: Mobile Video Networks for Efficient Vi... | 2021-03-21 | Code |
| 84 | MLP-3D | 81.4 | No | MLP-3D: A MLP-like 3D Architecture with Grouped ... | 2022-06-13 | - |
| 85 | R[2+1]D-152 (IG-65M pretraining) | 81.3 | No | Video Classification with Channel-Separated Conv... | 2019-04-04 | Code |
| 86 | LGD-3D Two-stream (ResNet-101) | 81.2 | No | Learning Spatio-Temporal Representation with Loc... | 2019-06-13 | - |
| 87 | MViT-B, 64x3 | 81.2 | No | Multiscale Vision Transformers | 2021-04-22 | Code |
| 88 | Motionformer-HR | 81.1 | No | Keeping Your Eye on the Ball: Trajectory Attenti... | 2021-06-09 | Code |
| 89 | MVD (K400 pretrain, ViT-S, 16x224x224) | 81 | No | Masked Video Distillation: Rethinking Masked Fea... | 2022-12-08 | Code |
| 90 | MAR (50% mask, ViT-B, 16x4) | 81 | No | MAR: Masked Autoencoders for Efficient Action Re... | 2022-07-24 | Code |
| 91 | MoViNet-A5 | 80.9 | No | MoViNets: Mobile Video Networks for Efficient Vi... | 2021-03-21 | Code |
| 92 | MBT (AV) | 80.8 | No | Attention Bottlenecks for Multimodal Fusion | 2021-06-30 | Code |
| 93 | TimeSformer-L | 80.7 | No | Is Space-Time Attention All You Need for Video U... | 2021-02-09 | Code |
| 94 | Swin-B (ImageNet-1k pretrain) | 80.6 | No | Video Swin Transformer | 2021-06-24 | Code |
| 95 | Swin-S (ImageNet-1k pretrain) | 80.6 | No | Video Swin Transformer | 2021-06-24 | Code |
| 96 | En-VidTr-L | 80.5 | No | VidTr: Video Transformer Without Convolutions | 2021-04-23 | - |
| 97 | MoViNet-A4 | 80.5 | No | MoViNets: Mobile Video Networks for Efficient Vi... | 2021-03-21 | Code |
| 98 | OmniSource SlowOnly R101 8x8(ImageNet pretrain) | 80.5 | No | Omni-sourced Webly-supervised Learning for Video... | 2020-03-29 | Code |
| 99 | STAM (64 Frames) | 80.5 | No | An Image is Worth 16x16 Words, What is a Video W... | 2021-03-25 | Code |
| 100 | X3D-XXL | 80.4 | No | X3D: Expanding Architectures for Efficient Video... | 2020-04-09 | Code |
| 101 | R3D-RS-200 | 80.4 | No | Revisiting 3D ResNets for Video Recognition | 2021-09-03 | Code |
| 102 | OmniSource SlowOnly R101 8x8 (Scratch) | 80.4 | No | Omni-sourced Webly-supervised Learning for Video... | 2020-03-29 | Code |
| 103 | MViT-B, 32x3 | 80.2 | No | Multiscale Vision Transformers | 2021-04-22 | Code |
| 104 | AMD(ViT-S/16) | 80.1 | No | Asymmetric Masked Distillation for Pre-Training ... | 2023-11-06 | - |
| 105 | SlowFast 16x8 (ResNet-101 + NL) | 79.8 | Yes | SlowFast Networks for Video Recognition | 2018-12-10 | Code |
| 106 | CT-Net Ensemble | 79.8 | No | CT-Net: Channel Tensorization Network for Video ... | 2021-06-03 | Code |
| 107 | ViT-B-VTN+ ImageNet-21K (84.0 [10]) | 79.8 | No | Video Transformer Network | 2021-02-01 | Code |
| 108 | TimeSformer-HR | 79.7 | No | Is Space-Time Attention All You Need for Video U... | 2021-02-09 | Code |
| 109 | En-VidTr-M | 79.7 | No | VidTr: Video Transformer Without Convolutions | 2021-04-23 | - |
| 110 | LGD-3D RGB (ResNet-101) | 79.4 | No | Learning Spatio-Temporal Representation with Loc... | 2019-06-13 | - |
| 111 | TDN-ResNet101 (ensemble, ImageNet pretrained, RGB only) | 79.4 | No | TDN: Temporal Difference Networks for Efficient ... | 2020-12-18 | Code |
| 112 | En-VidTr-S | 79.4 | No | VidTr: Video Transformer Without Convolutions | 2021-04-23 | - |
| 113 | MAR (75% mask, ViT-B, 16x4) | 79.4 | No | MAR: Masked Autoencoders for Efficient Action Re... | 2022-07-24 | Code |
| 114 | STAM (16 Frames) | 79.3 | No | An Image is Worth 16x16 Words, What is a Video W... | 2021-03-25 | Code |
| 115 | ip-CSN-152 (Sports-1M pretraining) | 79.2 | No | Video Classification with Channel-Separated Conv... | 2019-04-04 | Code |
| 116 | CorrNet | 79.2 | No | Video Modeling with Correlation Networks | 2019-06-07 | - |
| 117 | OmniVL | 79.1 | No | OmniVL:One Foundation Model for Image-Language a... | 2022-09-15 | - |
| 118 | X3D-XL | 79.1 | No | X3D: Expanding Architectures for Efficient Video... | 2020-04-09 | Code |
| 119 | MVFNet-ResNet101 (ensemble, ImageNet pretrained, RGB only) | 79.1 | No | MVFNet: Multi-View Fusion Network for Efficient ... | 2020-12-13 | Code |
| 120 | TAdaConvNeXt-T | 79.1 | No | TAda! Temporally-Adaptive Convolutions for Video... | 2021-10-12 | Code |
| 121 | SlowFast 16x8 (ResNet-101) | 78.9 | No | SlowFast Networks for Video Recognition | 2018-12-10 | Code |
| 122 | G-Blend (Sports-1M pretrain) | 78.9 | No | What Makes Training Multi-Modal Classification N... | 2019-05-29 | Code |
| 123 | Swin-T (ImageNet-1k pretrain) | 78.8 | No | Video Swin Transformer | 2021-06-24 | Code |
| 124 | GB + DF + LB (ResNet 152, ImageNet pretrained) | 78.8 | No | Action recognition with spatial-temporal discrim... | 2019-08-20 | - |
| 125 | ViT-B-VTN (3 layers, ImageNet pretrain) | 78.6 | No | Video Transformer Network | 2021-02-01 | Code |
| 126 | MViT-B, 16x4 | 78.4 | No | Multiscale Vision Transformers | 2021-04-22 | Code |
| 127 | MoViNet-A3 | 78.2 | No | MoViNets: Mobile Video Networks for Efficient Vi... | 2021-03-21 | Code |
| 128 | TAda2D-En (ResNet-50, 8+16 frames) | 78.2 | No | TAda! Temporally-Adaptive Convolutions for Video... | 2021-10-12 | Code |
| 129 | SVT | 78.1 | No | Self-supervised Video Transformer | 2021-12-02 | Code |
| 130 | TimeSformer | 78 | No | Is Space-Time Attention All You Need for Video U... | 2021-02-09 | Code |
| 131 | SlowFast 8x8 (ResNet-101) | 77.9 | No | SlowFast Networks for Video Recognition | 2018-12-10 | Code |
| 132 | RepFlow-50 ([2+1]D CNN, FcF, Non-local block) | 77.9 | No | Representation Flow for Action Recognition | 2018-10-02 | Code |
| 133 | ip-CSN-152 | 77.8 | No | Video Classification with Channel-Separated Conv... | 2019-04-04 | Code |
| 134 | I3D + NL | 77.7 | No | Non-local Neural Networks | 2017-11-21 | Code |
| 135 | G-Blend | 77.7 | No | What Makes Training Multi-Modal Classification N... | 2019-05-29 | Code |
| 136 | HATNet (32 frames) | 77.6 | No | Large Scale Holistic Video Understanding | 2019-04-25 | Code |
| 137 | X3D-L | 77.5 | No | X3D: Expanding Architectures for Efficient Video... | 2020-04-09 | Code |
| 138 | CoST ResNet-101 (ImageNet pretrain) | 77.5 | No | - | - | Code |
| 139 | TAda2D (ResNet-50, 16 frames) | 77.4 | No | TAda! Temporally-Adaptive Convolutions for Video... | 2021-10-12 | Code |
| 140 | EvaNet | 77.4 | No | Evolving Space-Time Neural Architectures for Vid... | 2018-11-26 | - |
| 141 | RNL+TSM Ensemble(ResNet50, 8 + 16 frames) | 77.4 | No | Region-based Non-local Operation for Video Class... | 2020-07-17 | Code |
| 142 | VIMPAC | 77.4 | No | VIMPAC: Video Pre-Training via Masked Token Pred... | 2021-06-21 | Code |
| 143 | BQN (ResNet-50) | 77.3 | No | Busy-Quiet Video Disentangling for Video Classif... | 2021-03-29 | Code |
| 144 | S3D-G (RGB+Flow, ImageNet pretrained) | 77.2 | No | Rethinking Spatiotemporal Feature Learning: Spee... | 2017-12-13 | Code |
| 145 | SlowFast 8x8 (ResNet-50) | 77 | No | SlowFast Networks for Video Recognition | 2018-12-10 | Code |
| 146 | TAda2D (ResNet-50, 8 frames) | 76.7 | No | TAda! Temporally-Adaptive Convolutions for Video... | 2021-10-12 | Code |
| 147 | D3D+S3D-G (RGB + RGB) | 76.5 | No | D3D: Distilled 3D Networks for Video Action Reco... | 2018-12-19 | Code |
| 148 | MSNet-R50 (16 frames, ImageNet pretrained) | 76.4 | No | MotionSqueeze: Neural Motion Feature Learning fo... | 2020-07-20 | Code |
| 149 | GloRe | 76.1 | No | Global Textual Relation Embedding for Relational... | 2019-06-03 | Code |
| 150 | X3D-M | 76 | No | X3D: Expanding Architectures for Efficient Video... | 2020-04-09 | Code |
| 151 | MViT-S | 76 | No | Multiscale Vision Transformers | 2021-04-22 | Code |
| 152 | CMA iter1 (16 frames) | 75.98 | No | Two-Stream Video Classification with Cross-Modal... | 2019-08-01 | - |
| 153 | D3D (RGB) | 75.9 | No | D3D: Distilled 3D Networks for Video Action Reco... | 2018-12-19 | Code |
| 154 | Oct-I3D + NL | 75.7 | No | Drop an Octave: Reducing Spatial Redundancy in C... | 2019-04-10 | Code |
| 155 | SlowFast 4x16 (ResNet-50) | 75.6 | No | SlowFast Networks for Video Recognition | 2018-12-10 | Code |
| 156 | R[2+1]D-Flow (Sports-1M pretrain) | 75.4 | No | A Closer Look at Spatiotemporal Convolutions for... | 2017-11-30 | Code |
| 157 | FASTER32 | 75.1 | No | FASTER Recurrent Networks for Efficient Video Cl... | 2019-06-10 | - |
| 158 | MoViNet-A2 | 75 | No | MoViNets: Mobile Video Networks for Efficient Vi... | 2021-03-21 | Code |
| 159 | MARS+RGB+Flow (64 frames) | 74.9 | No | - | - | Code |
| 160 | S3D-G (RGB, ImageNet pretrained) | 74.7 | No | Rethinking Spatiotemporal Feature Learning: Spee... | 2017-12-13 | Code |
| 161 | TSM | 74.7 | No | TSM: Temporal Shift Module for Efficient Video U... | 2018-11-20 | Code |
| 162 | A2 Net | 74.6 | No | $A^2$-Nets: Double Attention Networks | 2018-10-27 | - |
| 163 | R[2+1]D-RGB (Sports-1M pretrain) | 74.3 | No | A Closer Look at Spatiotemporal Convolutions for... | 2017-11-30 | Code |
| 164 | TSN | 73.9 | No | ConvNet Architecture Search for Spatiotemporal F... | 2017-08-16 | Code |
| 165 | R[2+1]D-Two-Stream | 73.9 | No | A Closer Look at Spatiotemporal Convolutions for... | 2017-11-30 | Code |
| 166 | TSN | 73.9 | No | ConvNet Architecture Search for Spatiotemporal F... | 2017-08-16 | Code |
| 167 | STM (ResNet-50) | 73.7 | No | STM: SpatioTemporal and Motion Encoding for Acti... | 2019-08-07 | - |
| 168 | bLVNet Fan et al. (2019) | 73.5 | No | More Is Less: Learning Efficient Video Represent... | 2019-12-02 | Code |
| 169 | Co Slow_64 | 73.05 | No | Continual 3D Convolutional Neural Networks for R... | 2021-05-31 | Code |
| 170 | Inception-ResNet | 73 | No | Revisiting the Effectiveness of Off-the-shelf Te... | 2017-08-12 | - |
| 171 | MFNet | 72.8 | No | Multi-Fiber Networks for Video Recognition | 2018-07-30 | - |
| 172 | MoViNet-A1 | 72.7 | No | MoViNets: Mobile Video Networks for Efficient Vi... | 2021-03-21 | Code |
| 173 | ARTNet | 72.4 | No | Appearance-and-Relation Networks for Video Class... | 2017-11-24 | Code |
| 174 | LGD-3D Flow (ResNet-101) | 72.3 | No | Learning Spatio-Temporal Representation with Loc... | 2019-06-13 | - |
| 175 | R[2+1]D | 72 | No | A Closer Look at Spatiotemporal Convolutions for... | 2017-11-30 | Code |
| 176 | R[2+1]D-RGB | 72 | No | A Closer Look at Spatiotemporal Convolutions for... | 2017-11-30 | Code |
| 177 | FASTER16 w/o sp | 71.7 | No | FASTER Recurrent Networks for Efficient Video Cl... | 2019-06-10 | - |
| 178 | Co X3D-L_64 | 71.61 | No | Continual 3D Convolutional Neural Networks for R... | 2021-05-31 | Code |
| 179 | I3D | 71.1 | No | Quo Vadis, Action Recognition? A New Model and t... | 2017-05-22 | Code |
| 180 | Co X3D-M_64 | 71.03 | No | Continual 3D Convolutional Neural Networks for R... | 2021-05-31 | Code |
| 181 | X3D-L | 69.29 | No | Continual 3D Convolutional Neural Networks for R... | 2021-05-31 | Code |
| 182 | MARS+RGB+Flow (16 frames) | 68.9 | No | - | - | Code |
| 183 | SlowFast-8×8-R50 | 68.45 | No | Continual 3D Convolutional Neural Networks for R... | 2021-05-31 | Code |
| 184 | S3D-G (Flow, ImageNet pretrained) | 68 | No | Rethinking Spatiotemporal Feature Learning: Spee... | 2017-12-13 | Code |
| 185 | R[2+1]D-Flow | 67.5 | No | A Closer Look at Spatiotemporal Convolutions for... | 2017-11-30 | Code |
| 186 | Slow-8x8-R50 | 67.42 | No | Continual 3D Convolutional Neural Networks for R... | 2021-05-31 | Code |
| 187 | Co X3D-S_64 | 67.33 | No | Continual 3D Convolutional Neural Networks for R... | 2021-05-31 | Code |
| 188 | X3D-M | 67.24 | No | Continual 3D Convolutional Neural Networks for R... | 2021-05-31 | Code |
| 189 | SlowFast-4×16-R50 | 67.06 | No | Continual 3D Convolutional Neural Networks for R... | 2021-05-31 | Code |
| 190 | Co Slow_8 | 65.9 | No | Continual 3D Convolutional Neural Networks for R... | 2021-05-31 | Code |
| 191 | MoViNet-A0 | 65.8 | No | MoViNets: Mobile Video Networks for Efficient Vi... | 2021-03-21 | Code |
| 192 | X3D-S | 64.71 | No | Continual 3D Convolutional Neural Networks for R... | 2021-05-31 | Code |
| 193 | I3D-R50 | 63.98 | No | Continual 3D Convolutional Neural Networks for R... | 2021-05-31 | Code |
| 194 | Co X3D-L_16 | 63.03 | No | Continual 3D Convolutional Neural Networks for R... | 2021-05-31 | Code |
| 195 | Co X3D-M_16 | 62.8 | No | Continual 3D Convolutional Neural Networks for R... | 2021-05-31 | Code |
| 196 | Co X3D-S_13 | 60.18 | No | Continual 3D Convolutional Neural Networks for R... | 2021-05-31 | Code |
| 197 | Co I3D_8 | 59.58 | No | Continual 3D Convolutional Neural Networks for R... | 2021-05-31 | Code |
| 198 | R(2+1)D-18_16 | 59.52 | No | Continual 3D Convolutional Neural Networks for R... | 2021-05-31 | Code |
| 199 | X3D-XS | 59.37 | No | Continual 3D Convolutional Neural Networks for R... | 2021-05-31 | Code |
| 200 | Co I3D_64 | 56.86 | No | Continual 3D Convolutional Neural Networks for R... | 2021-05-31 | Code |
| 201 | R(2+1)D-18_8 | 53.52 | No | Continual 3D Convolutional Neural Networks for R... | 2021-05-31 | Code |
| 202 | RCU_8 | 53.4 | No | Continual 3D Convolutional Neural Networks for R... | 2021-05-31 | Code |