| 1 | MVD (Kinetics400 pretrain, ViT-H, 16 frame) | 77.3 | Yes | Masked Video Distillation: Rethinking Masked Fea... | 2022-12-08 | Code |
| 2 | DejaVid | 77.2 | Yes | - | - | Code |
| 3 | InternVideo | 77.2 | Yes | InternVideo: General Video Foundation Models via... | 2022-12-06 | Code |
| 4 | InternVideo2-1B | 77.1 | Yes | InternVideo2: Scaling Foundation Models for Mult... | 2024-03-22 | Code |
| 5 | VideoMAE V2-g | 77 | Yes | VideoMAE V2: Scaling Video Masked Autoencoders w... | 2023-03-29 | Code |
| 6 | MVD (Kinetics400 pretrain, ViT-L, 16 frame) | 76.7 | Yes | Masked Video Distillation: Rethinking Masked Fea... | 2022-12-08 | Code |
| 7 | Hiera-L (no extra data) | 76.5 | No | Hiera: A Hierarchical Vision Transformer without... | 2023-06-01 | Code |
| 8 | TubeViT-L | 76.1 | No | Rethinking Video ViTs: Sparse Video Tubes for Jo... | 2022-12-06 | Code |
| 9 | VideoMAE (no extra data, ViT-L, 32x2) | 75.4 | No | VideoMAE: Masked Autoencoders are Data-Efficient... | 2022-03-23 | Code |
| 10 | Side4Video (EVA ViT-E/14) | 75.2 | No | Side4Video: Spatial-Temporal Side Network for Me... | 2023-11-27 | Code |
| 11 | MaskFeat (Kinetics600 pretrain, MViT-L) | 75 | Yes | Masked Feature Prediction for Self-Supervised Vi... | 2021-12-16 | Code |
| 12 | MAR (50% mask, ViT-L, 16x4) | 74.7 | No | MAR: Masked Autoencoders for Efficient Action Re... | 2022-07-24 | Code |
| 13 | ATM | 74.6 | No | What Can Simple Arithmetic Operations Do for Tem... | 2023-07-18 | Code |
| 14 | MAWS (ViT-L) | 74.4 | Yes | The effectiveness of MAE pre-pretraining for bil... | 2023-03-23 | Code |
| 15 | VideoMAE (no extra data, ViT-L, 16frame) | 74.3 | No | VideoMAE: Masked Autoencoders are Data-Efficient... | 2022-03-23 | Code |
| 16 | MAR (75% mask, ViT-L, 16x4) | 73.8 | No | MAR: Masked Autoencoders for Efficient Action Re... | 2022-07-24 | Code |
| 17 | MVD (Kinetics400 pretrain, ViT-B, 16 frame) | 73.7 | Yes | Masked Video Distillation: Rethinking Masked Fea... | 2022-12-08 | Code |
| 18 | ViC-MAE (ViT-L) | 73.7 | No | ViC-MAE: Self-Supervised Representation Learning... | 2023-03-21 | Code |
| 19 | TAdaFormer-L/14 | 73.6 | Yes | Temporally-Adaptive Models for Efficient Video U... | 2023-08-10 | Code |
| 20 | TDS-CLIP-ViT-L/14(8frames) | 73.4 | No | TDS-CLIP: Temporal Difference Side Network for I... | 2024-08-20 | Code |
| 21 | MViTv2-L (IN-21K + Kinetics400 pretrain) | 73.3 | No | MViTv2: Improved Multiscale Vision Transformers ... | 2021-12-02 | Code |
| 22 | AMD(ViT-B/16) | 73.3 | No | Asymmetric Masked Distillation for Pre-Training ... | 2023-11-06 | - |
| 23 | UniFormerV2-L | 73 | Yes | - | - | Code |
| 24 | ST-Adapter (ViT-L, CLIP) | 72.3 | Yes | ST-Adapter: Parameter-Efficient Image-to-Video T... | 2022-06-27 | Code |
| 25 | ZeroI2V ViT-L/14 | 72.2 | Yes | ZeroI2V: Zero-Cost Adaptation of Pre-trained Tra... | 2023-10-02 | Code |
| 26 | MViT-B (IN-21K + Kinetics400 pretrain) | 72.1 | Yes | MViTv2: Improved Multiscale Vision Transformers ... | 2021-12-02 | Code |
| 27 | CAST(ViT-B/16) | 71.6 | No | CAST: Cross-Attention in Space and Time for Vide... | 2023-11-30 | Code |
| 28 | StructVit-B-4-1 | 71.5 | No | Learning Correlation Structures for Vision Trans... | 2024-04-05 | - |
| 29 | OMNIVORE (Swin-B, IN-21K+ Kinetics400 pretrain) | 71.4 | Yes | Omnivore: A Single Model for Many Visual Modalit... | 2022-01-20 | Code |
| 30 | BEVT (IN-1K + Kinetics400 pretrain) | 71.4 | Yes | BEVT: BERT Pretraining of Video Transformers | 2021-12-02 | Code |
| 31 | UniFormer-B (IN-1K + Kinetics400 pretrain) | 71.2 | Yes | - | - | Code |
| 32 | TAdaConvNeXtV2-B | 71.1 | Yes | Temporally-Adaptive Models for Efficient Video U... | 2023-08-10 | Code |
| 33 | MAR (50% mask, ViT-B, 16x4) | 71 | No | MAR: Masked Autoencoders for Efficient Action Re... | 2022-07-24 | Code |
| 34 | MVD (Kinetics400 pretrain, ViT-S, 16 frame) | 70.9 | Yes | Masked Video Distillation: Rethinking Masked Fea... | 2022-12-08 | Code |
| 35 | CoVeR(JFT-3B) | 70.9 | Yes | Co-training Transformer with Videos and Images I... | 2021-12-14 | - |
| 36 | VideoMAE (no extra data, ViT-B, 16frame) | 70.8 | No | VideoMAE: Masked Autoencoders are Data-Efficient... | 2022-03-23 | Code |
| 37 | AMD(ViT-S/16) | 70.2 | No | Asymmetric Masked Distillation for Pre-Training ... | 2023-11-06 | - |
| 38 | ILA (ViT-L/14) | 70.2 | No | Implicit Temporal Modeling with Learnable Alignm... | 2023-04-20 | Code |
| 39 | MorphMLP-B (IN-1K) | 70.1 | Yes | MorphMLP: An Efficient MLP-Like Backbone for Spa... | 2021-11-24 | Code |
| 40 | CoVeR(JFT-300M) | 69.8 | Yes | Co-training Transformer with Videos and Images I... | 2021-12-14 | - |
| 41 | TPS | 69.8 | No | Spatiotemporal Self-attention Modeling with Temp... | 2022-07-27 | Code |
| 42 | SIFA | 69.8 | No | Stand-Alone Inter-Frame Attention in Video Models | 2022-06-14 | Code |
| 43 | Swin-B (IN-21K + Kinetics400 pretrain) | 69.6 | Yes | Video Swin Transformer | 2021-06-24 | Code |
| 44 | TDN ResNet101 (one clip, three crop, 8+16 ensemble, ImageNet pretrained, RGB only) | 69.6 | Yes | TDN: Temporal Difference Networks for Efficient ... | 2020-12-18 | Code |
| 45 | MAR (75% mask, ViT-B, 16x4) | 69.5 | No | MAR: Masked Autoencoders for Efficient Action Re... | 2022-07-24 | Code |
| 46 | ORViT Mformer-L (ORViT blocks) | 69.5 | Yes | Object-Region Video Transformers | 2021-10-13 | Code |
| 47 | UniFormer-S (IN-1K + Kinetics600 pretrain) | 69.4 | Yes | - | - | Code |
| 48 | MML (ensemble) | 69.02 | Yes | Mutual Modality Learning for Video Action Classi... | 2020-11-04 | Code |
| 49 | MViT-B-24, 32x3 | 68.7 | Yes | Multiscale Vision Transformers | 2021-04-22 | Code |
| 50 | MTV-B | 68.5 | Yes | Multiview Transformers for Video Recognition | 2022-01-12 | Code |
| 51 | MLP-3D | 68.5 | No | MLP-3D: A MLP-like 3D Architecture with Grouped ... | 2022-06-13 | - |
| 52 | TDN ResNet101 (one clip, center crop, 8+16 ensemble, ImageNet pretrained, RGB only) | 68.2 | Yes | TDN: Temporal Difference Networks for Efficient ... | 2020-12-18 | Code |
| 53 | MSMA (8+16frames) | 68.2 | No | - | - | - |
| 54 | Mformer-L | 68.1 | Yes | Keeping Your Eye on the Ball: Trajectory Attenti... | 2021-06-09 | Code |
| 55 | VIMPAC | 68.1 | Yes | VIMPAC: Video Pre-Training via Masked Token Pred... | 2021-06-21 | Code |
| 56 | ORViT Mformer (ORViT blocks) | 67.9 | Yes | Object-Region Video Transformers | 2021-10-13 | Code |
| 57 | MViT-B, 32x3(Kinetics600 pretrain) | 67.8 | Yes | Multiscale Vision Transformers | 2021-04-22 | Code |
| 58 | GC-TDN Ensemble (R50,8+16) | 67.8 | Yes | Group Contextualization for Video Recognition | 2022-03-18 | Code |
| 59 | CT-Net Ensemble (R50, 8+12+16+24) | 67.8 | Yes | CT-Net: Channel Tensorization Network for Video ... | 2021-06-03 | Code |
| 60 | TCM (Ensemble) | 67.8 | No | Motion-driven Visual Tempo Learning for Video-ba... | 2022-02-24 | Code |
| 61 | SELFYNet-TSM-R50En (8+16 frames, ImageNet pretrained, 2 clips) | 67.7 | Yes | Learning Self-Similarity in Space and Time as Ge... | 2021-02-14 | Code |
| 62 | RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips | 67.7 | Yes | Relational Self-Attention: What's Missing in Att... | 2021-11-02 | Code |
| 63 | GTDNet | 67.6 | No | - | - | - |
| 64 | SELFYNet-TSM-R50En (8+16 frames, ImageNet pretrained, a single clip) | 67.4 | Yes | Learning Self-Similarity in Space and Time as Ge... | 2021-02-14 | Code |
| 65 | VoV3D-L (32frames, Kinetics pretrained, single) | 67.35 | Yes | Diverse Temporal Aggregation and Depthwise Spati... | 2020-12-01 | Code |
| 66 | PLAR | 67.3 | No | SCP: Soft Conditional Prompt Learning for Aerial... | 2023-05-21 | - |
| 67 | RSANet-R50 (8+16 frames, ImageNet pretrained, a single clip) | 67.3 | Yes | Relational Self-Attention: What's Missing in Att... | 2021-11-02 | Code |
| 68 | X-Vit (x16) | 67.2 | Yes | Space-time Mixing Attention for Video Transformer | 2021-06-10 | Code |
| 69 | TAda2D-En (ResNet-50, 8+16 frames) | 67.2 | Yes | TAda! Temporally-Adaptive Convolutions for Video... | 2021-10-12 | Code |
| 70 | Mformer-HR | 67.1 | Yes | Keeping Your Eye on the Ball: Trajectory Attenti... | 2021-06-09 | Code |
| 71 | TAdaConvNeXt-T | 67.1 | Yes | TAda! Temporally-Adaptive Convolutions for Video... | 2021-10-12 | Code |
| 72 | MoDS (8+16frames) | 67.1 | No | - | - | - |
| 73 | STPG (8+16frames) | 67 | No | - | - | - |
| 74 | MML (single) | 66.83 | Yes | Mutual Modality Learning for Video Action Classi... | 2020-11-04 | Code |
| 75 | ILA (ViT-B/16) | 66.8 | No | Implicit Temporal Modeling with Learnable Alignm... | 2023-04-20 | Code |
| 76 | TSM (RGB + Flow) | 66.6 | Yes | TSM: Temporal Shift Module for Efficient Video U... | 2018-11-20 | Code |
| 77 | MSNet-R50En (8+16 ensemble, ImageNet pretrained) | 66.6 | Yes | MotionSqueeze: Neural Motion Feature Learning fo... | 2020-07-20 | Code |
| 78 | PAN ResNet101 (RGB only, no Flow) | 66.5 | Yes | PAN: Towards Fast Action Recognition via Learnin... | 2020-08-08 | Code |
| 79 | TSM+W3 (16 frames, RGB ResNet-50) | 66.5 | Yes | Knowing What, Where and When to Look: Efficient ... | 2020-04-02 | - |
| 80 | Mformer | 66.5 | Yes | Keeping Your Eye on the Ball: Trajectory Attenti... | 2021-06-09 | Code |
| 81 | MVFNet-ResNet50 (center crop, 8+16 ensemble, ImageNet pretrained, RGB only) | 66.3 | Yes | MVFNet: Multi-View Fusion Network for Efficient ... | 2020-12-13 | Code |
| 82 | MViT-B, 16x4 | 66.2 | Yes | Multiscale Vision Transformers | 2021-04-22 | Code |
| 83 | RSANet-R50 (16 frames, ImageNet pretrained, a single clip) | 66 | Yes | Relational Self-Attention: What's Missing in Att... | 2021-11-02 | Code |
| 84 | VoV3D-L (32frames, from scratch, single) | 65.8 | No | Diverse Temporal Aggregation and Depthwise Spati... | 2020-12-01 | Code |
| 85 | E3D-L | 65.7 | No | Maximizing Spatio-Temporal Entropy of Deep 3D CN... | 2023-03-05 | Code |
| 86 | SELFYNet-TSM-R50 (16 frames, ImageNet pretrained) | 65.7 | Yes | Learning Self-Similarity in Space and Time as Ge... | 2021-02-14 | Code |
| 87 | TAda2D (ResNet-50, 16 frames) | 65.6 | Yes | TAda! Temporally-Adaptive Convolutions for Video... | 2021-10-12 | Code |
| 88 | ViViT-L/16x2 Fact. encoder | 65.4 | Yes | ViViT: A Video Vision Transformer | 2021-03-29 | Code |
| 89 | VoV3D-M (32frames, Kinetics pretrained, single) | 65.24 | Yes | Diverse Temporal Aggregation and Depthwise Spati... | 2020-12-01 | Code |
| 90 | bLVNet | 65.2 | Yes | More Is Less: Learning Efficient Video Represent... | 2019-12-02 | Code |
| 91 | DirecFormer | 64.94 | No | DirecFormer: A Directed Attention in Transformer... | 2022-03-19 | Code |
| 92 | RSANet-R50 (8 frames, ImageNet pretrained, a single clip) | 64.8 | Yes | Relational Self-Attention: What's Missing in Att... | 2021-11-02 | Code |
| 93 | MSNet-R50 (16 frames, ImageNet pretrained) | 64.7 | Yes | MotionSqueeze: Neural Motion Feature Learning fo... | 2020-07-20 | Code |
| 94 | AK-Net | 64.3 | No | Action Keypoint Network for Efficient Video Reco... | 2022-01-17 | - |
| 95 | VoV3D-M (32frames, from scratch, single) | 64.2 | No | Diverse Temporal Aggregation and Depthwise Spati... | 2020-12-01 | Code |
| 96 | STM (16 frames, ImageNet pretraining) | 64.2 | No | STM: SpatioTemporal and Motion Encoding for Acti... | 2019-08-07 | - |
| 97 | VoV3D-L (16frames, from scratch, single) | 64.1 | No | Diverse Temporal Aggregation and Depthwise Spati... | 2020-12-01 | Code |
| 98 | TAda2D (ResNet-50, 8 frames) | 64 | Yes | TAda! Temporally-Adaptive Convolutions for Video... | 2021-10-12 | Code |
| 99 | MoViNet-A2 | 63.5 | No | MoViNets: Mobile Video Networks for Efficient Vi... | 2021-03-21 | Code |
| 100 | VoV3D-M (16frames, from scratch, single) | 63.2 | No | Diverse Temporal Aggregation and Depthwise Spati... | 2020-12-01 | Code |
| 101 | MSNet-R50 (8 frames, ImageNet pretrained) | 63 | Yes | MotionSqueeze: Neural Motion Feature Learning fo... | 2020-07-20 | Code |
| 102 | MoViNet-A1 | 62.7 | No | MoViNets: Mobile Video Networks for Efficient Vi... | 2021-03-21 | Code |
| 103 | OmniVL | 62.5 | No | OmniVL:One Foundation Model for Image-Language a... | 2022-09-15 | - |
| 104 | TimeSformer-HR | 62.5 | Yes | Is Space-Time Attention All You Need for Video U... | 2021-02-09 | Code |
| 105 | TimeSformer-L | 62.3 | Yes | Is Space-Time Attention All You Need for Video U... | 2021-02-09 | Code |
| 106 | TRG (ResNet-50) | 62.2 | No | Temporal Reasoning Graph for Activity Recognition | 2019-08-27 | - |
| 107 | TPN (TSM-50) | 62 | No | Temporal Pyramid Network for Action Recognition | 2020-04-07 | Code |
| 108 | Multigrid | 61.7 | Yes | A Multigrid Method for Efficiently Training Vide... | 2019-12-02 | Code |
| 109 | SlowFast | 61.7 | Yes | SlowFast Networks for Video Recognition | 2018-12-10 | Code |
| 110 | TRG (Inception-V3) | 61.3 | No | Temporal Reasoning Graph for Activity Recognition | 2019-08-27 | - |
| 111 | MoViNet-A0 | 61.3 | No | MoViNets: Mobile Video Networks for Efficient Vi... | 2021-03-21 | Code |
| 112 | CCS + two-stream + TRN | 61.2 | No | Cooperative Cross-Stream Network for Discriminat... | 2019-08-27 | - |
| 113 | VidTr-L | 60.2 | No | VidTr: Video Transformer Without Convolutions | 2021-04-23 | - |
| 114 | TimeSformer | 59.5 | Yes | Is Space-Time Attention All You Need for Video U... | 2021-02-09 | Code |
| 115 | SVT | 59.2 | No | Self-supervised Video Transformer | 2021-12-02 | Code |
| 116 | CPNet Res34, 5 CP | 57.65 | No | Learning Video Representations from Corresponden... | 2019-05-20 | Code |
| 117 | 2-Stream TRN | 55.52 | No | Temporal Relational Reasoning in Videos | 2017-11-22 | Code |
| 118 | TAM (5-shot) | 52.3 | No | Few-Shot Video Classification via Temporal Align... | 2019-06-27 | - |
| 119 | model3D_1 with left-right augmentation and fps jitter | 51.33 | No | The "something something" video database for lea... | 2017-06-13 | Code |
| 120 | Prob-Distill | 49.9 | No | Attention Distillation for Learning Video Repres... | 2019-04-05 | - |
| 121 | STM + TRNMultiscale | 47.73 | No | Comparative Analysis of CNN-based Spatiotemporal... | 2019-09-11 | Code |
| 122 | DIN | 34.11 | No | DenseImage Network: Video Spatial-Temporal Evolu... | 2018-05-19 | - |
| 123 | InternVideo2-6B | 1 | Yes | InternVideo2: Scaling Foundation Models for Mult... | 2024-03-22 | Code |