| 1 | DejaVid | 96.3 | Yes | - | - | Code |
| 2 | VideoMAE V2-g | 95.9 | Yes | VideoMAE V2: Scaling Video Masked Autoencoders w... | 2023-03-29 | Code |
| 3 | MVD (Kinetics400 pretrain, ViT-H, 16 frame) | 95.7 | Yes | Masked Video Distillation: Rethinking Masked Fea... | 2022-12-08 | Code |
| 4 | MVD (Kinetics400 pretrain, ViT-L, 16 frame) | 95.5 | Yes | Masked Video Distillation: Rethinking Masked Fea... | 2022-12-08 | Code |
| 5 | TubeViT-L | 95.2 | No | Rethinking Video ViTs: Sparse Video Tubes for Jo... | 2022-12-06 | Code |
| 6 | VideoMAE (no extra data, ViT-L, 32x2) | 95.2 | No | VideoMAE: Masked Autoencoders are Data-Efficient... | 2022-03-23 | Code |
| 7 | MaskFeat (Kinetics600 pretrain, MViT-L) | 95 | Yes | Masked Feature Prediction for Self-Supervised Vi... | 2021-12-16 | Code |
| 8 | MAR (50% mask, ViT-L, 16x4) | 94.9 | No | MAR: Masked Autoencoders for Efficient Action Re... | 2022-07-24 | Code |
| 9 | VideoMAE (no extra data, ViT-L, 16frame) | 94.6 | No | VideoMAE: Masked Autoencoders are Data-Efficient... | 2022-03-23 | Code |
| 10 | UniFormerV2-L | 94.5 | Yes | - | - | Code |
| 11 | ATM | 94.4 | No | What Can Simple Arithmetic Operations Do for Tem... | 2023-07-18 | Code |
| 12 | MAR (75% mask, ViT-L, 16x4) | 94.4 | No | MAR: Masked Autoencoders for Efficient Action Re... | 2022-07-24 | Code |
| 13 | MViTv2-L (IN-21K + Kinetics400 pretrain) | 94.1 | No | MViTv2: Improved Multiscale Vision Transformers ... | 2021-12-02 | Code |
| 14 | Side4Video (EVA ViT-E/14) | 94 | No | Side4Video: Spatial-Temporal Side Network for Me... | 2023-11-27 | Code |
| 15 | MVD (Kinetics400 pretrain, ViT-B, 16 frame) | 94 | Yes | Masked Video Distillation: Rethinking Masked Fea... | 2022-12-08 | Code |
| 16 | AMD(ViT-B/16) | 94 | No | Asymmetric Masked Distillation for Pre-Training ... | 2023-11-06 | - |
| 17 | ST-Adapter (ViT-L, CLIP) | 93.9 | Yes | ST-Adapter: Parameter-Efficient Image-to-Video T... | 2022-06-27 | Code |
| 18 | TDS-CLIP-ViT-L/14(8frames) | 93.8 | No | TDS-CLIP: Temporal Difference Side Network for I... | 2024-08-20 | Code |
| 19 | OMNIVORE (Swin-B, IN-21K+ Kinetics400 pretrain) | 93.5 | Yes | Omnivore: A Single Model for Many Visual Modalit... | 2022-01-20 | Code |
| 20 | MViTv2-B (IN-21K + Kinetics400 pretrain) | 93.4 | No | MViTv2: Improved Multiscale Vision Transformers ... | 2021-12-02 | Code |
| 21 | ZeroI2V ViT-L/14 | 93 | Yes | ZeroI2V: Zero-Cost Adaptation of Pre-trained Tra... | 2023-10-02 | Code |
| 22 | UniFormer-B (IN-1K + Kinetics400 pretrain) | 92.8 | Yes | - | - | Code |
| 23 | MAR (50% mask, ViT-B, 16x4) | 92.8 | No | MAR: Masked Autoencoders for Efficient Action Re... | 2022-07-24 | Code |
| 24 | MVD (Kinetics400 pretrain, ViT-S, 16 frame) | 92.8 | Yes | Masked Video Distillation: Rethinking Masked Fea... | 2022-12-08 | Code |
| 25 | MorphMLP-B (IN-1K) | 92.8 | Yes | MorphMLP: An Efficient MLP-Like Backbone for Spa... | 2021-11-24 | Code |
| 26 | Swin-B (IN-21K + Kinetics400 pretrain) | 92.7 | Yes | Video Swin Transformer | 2021-06-24 | Code |
| 27 | MML (ensemble) | 92.7 | Yes | Mutual Modality Learning for Video Action Classi... | 2020-11-04 | Code |
| 28 | CoVeR(JFT-3B) | 92.5 | Yes | Co-training Transformer with Videos and Images I... | 2021-12-14 | - |
| 29 | AMD(ViT-S/16) | 92.5 | No | Asymmetric Masked Distillation for Pre-Training ... | 2023-11-06 | - |
| 30 | VideoMAE (no extra data, ViT-B, 16frame) | 92.4 | No | VideoMAE: Masked Autoencoders are Data-Efficient... | 2022-03-23 | Code |
| 31 | TDN ResNet101 (one clip, three crop, 8+16 ensemble, ImageNet pretrained, RGB only) | 92.2 | Yes | TDN: Temporal Difference Networks for Efficient ... | 2020-12-18 | Code |
| 32 | UniFormer-S (IN-1K + Kinetics600 pretrain) | 92.1 | Yes | - | - | Code |
| 33 | CoVeR(JFT-300M) | 91.9 | Yes | Co-training Transformer with Videos and Images I... | 2021-12-14 | - |
| 34 | MAR (75% mask, ViT-B, 16x4) | 91.9 | No | MAR: Masked Autoencoders for Efficient Action Re... | 2022-07-24 | Code |
| 35 | ILA (ViT-L/14) | 91.8 | No | Implicit Temporal Modeling with Learnable Alignm... | 2023-04-20 | Code |
| 36 | TDN ResNet101 (one clip, center crop, 8+16 ensemble, ImageNet pretrained, RGB only) | 91.6 | Yes | TDN: Temporal Difference Networks for Efficient ... | 2020-12-18 | Code |
| 37 | ORViT Mformer-L (ORViT blocks) | 91.5 | Yes | Object-Region Video Transformers | 2021-10-13 | Code |
| 38 | MViT-B-24, 32x3 | 91.5 | Yes | Multiscale Vision Transformers | 2021-04-22 | Code |
| 39 | TRG (Inception-V3) | 91.4 | No | Temporal Reasoning Graph for Activity Recognition | 2019-08-27 | - |
| 40 | MViT-B, 32x3(Kinetics600 pretrain) | 91.3 | Yes | Multiscale Vision Transformers | 2021-04-22 | Code |
| 41 | MML (single) | 91.3 | Yes | Mutual Modality Learning for Video Action Classi... | 2020-11-04 | Code |
| 42 | TSM (RGB + Flow) | 91.3 | Yes | TSM: Temporal Shift Module for Efficient Video U... | 2018-11-20 | Code |
| 43 | Mformer-L | 91.2 | Yes | Keeping Your Eye on the Ball: Trajectory Attenti... | 2021-06-09 | Code |
| 44 | GC-TDN Ensemble (R50,8+16) | 91.2 | Yes | Group Contextualization for Video Recognition | 2022-03-18 | Code |
| 45 | CT-Net Ensemble (R50, 8+12+16+24) | 91.1 | Yes | CT-Net: Channel Tensorization Network for Video ... | 2021-06-03 | Code |
| 46 | SELFYNet-TSM-R50En (8+16 frames, ImageNet pretrained, 2 clips) | 91.1 | Yes | Learning Self-Similarity in Space and Time as Ge... | 2021-02-14 | Code |
| 47 | RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips | 91.1 | Yes | Relational Self-Attention: What's Missing in Att... | 2021-11-02 | Code |
| 48 | RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips) | 91.1 | No | Relational Self-Attention: What's Missing in Att... | 2021-11-02 | Code |
| 49 | SELFYNet-TSM-R50En (8+16 frames, ImageNet pretrained, a single clip) | 91 | Yes | Learning Self-Similarity in Space and Time as Ge... | 2021-02-14 | Code |
| 50 | PLAR | 91 | No | SCP: Soft Conditional Prompt Learning for Aerial... | 2023-05-21 | - |
| 51 | RSANet-R50 (8+16 frames, ImageNet pretrained, a single clip) | 90.8 | Yes | Relational Self-Attention: What's Missing in Att... | 2021-11-02 | Code |
| 52 | X-Vit (x16) | 90.8 | Yes | Space-time Mixing Attention for Video Transformer | 2021-06-10 | Code |
| 53 | Mformer-HR | 90.6 | Yes | Keeping Your Eye on the Ball: Trajectory Attenti... | 2021-06-09 | Code |
| 54 | MSNet-R50En (8+16 ensemble, ImageNet pretrained) | 90.6 | Yes | MotionSqueeze: Neural Motion Feature Learning fo... | 2020-07-20 | Code |
| 55 | PAN ResNet101 (RGB only, no Flow) | 90.6 | Yes | PAN: Towards Fast Action Recognition via Learnin... | 2020-08-08 | Code |
| 56 | ORViT Mformer (ORViT blocks) | 90.5 | Yes | Object-Region Video Transformers | 2021-10-13 | Code |
| 57 | VoV3D-L (32frames, Kinetics pretrained, single) | 90.5 | Yes | Diverse Temporal Aggregation and Depthwise Spati... | 2020-12-01 | Code |
| 58 | MTV-B | 90.4 | Yes | Multiview Transformers for Video Recognition | 2022-01-12 | Code |
| 59 | TAdaConvNeXt-T | 90.4 | Yes | TAda! Temporally-Adaptive Convolutions for Video... | 2021-10-12 | Code |
| 60 | TSM+W3 (16 frames, RGB ResNet-50) | 90.4 | Yes | Knowing What, Where and When to Look: Efficient ... | 2020-04-02 | - |
| 61 | ILA (ViT-B/16) | 90.3 | No | Implicit Temporal Modeling with Learnable Alignm... | 2023-04-20 | Code |
| 62 | TRG (ResNet-50) | 90.3 | No | Temporal Reasoning Graph for Activity Recognition | 2019-08-27 | - |
| 63 | MViT-B, 16x4 | 90.2 | Yes | Multiscale Vision Transformers | 2021-04-22 | Code |
| 64 | Mformer | 90.1 | Yes | Keeping Your Eye on the Ball: Trajectory Attenti... | 2021-06-09 | Code |
| 65 | TAda2D-En (ResNet-50, 8+16 frames) | 89.8 | Yes | TAda! Temporally-Adaptive Convolutions for Video... | 2021-10-12 | Code |
| 66 | RSANet-R50 (16 frames, ImageNet pretrained, a single clip) | 89.8 | Yes | Relational Self-Attention: What's Missing in Att... | 2021-11-02 | Code |
| 67 | E3D-L | 89.8 | No | Maximizing Spatio-Temporal Entropy of Deep 3D CN... | 2023-03-05 | Code |
| 68 | SELFYNet-TSM-R50 (16 frames, ImageNet pretrained) | 89.8 | Yes | Learning Self-Similarity in Space and Time as Ge... | 2021-02-14 | Code |
| 69 | ViViT-L/16x2 Fact. encoder | 89.8 | Yes | ViViT: A Video Vision Transformer | 2021-03-29 | Code |
| 70 | STM (16 frames, ImageNet pretraining) | 89.8 | No | STM: SpatioTemporal and Motion Encoding for Acti... | 2019-08-07 | - |
| 71 | VoV3D-L (32frames, from scratch, single) | 89.5 | No | Diverse Temporal Aggregation and Depthwise Spati... | 2020-12-01 | Code |
| 72 | VoV3D-M (32frames, Kinetics pretrained, single) | 89.48 | Yes | Diverse Temporal Aggregation and Depthwise Spati... | 2020-12-01 | Code |
| 73 | MSNet-R50 (16 frames, ImageNet pretrained) | 89.4 | Yes | MotionSqueeze: Neural Motion Feature Learning fo... | 2020-07-20 | Code |
| 74 | CCS + two-stream + TRN | 89.3 | No | Cooperative Cross-Stream Network for Discriminat... | 2019-08-27 | - |
| 75 | TAda2D (ResNet-50, 16 frames) | 89.2 | Yes | TAda! Temporally-Adaptive Convolutions for Video... | 2021-10-12 | Code |
| 76 | RSANet-R50 (8 frames, ImageNet pretrained, a single clip) | 89.1 | Yes | Relational Self-Attention: What's Missing in Att... | 2021-11-02 | Code |
| 77 | MoViNet-A2 | 89 | No | MoViNets: Mobile Video Networks for Efficient Vi... | 2021-03-21 | Code |
| 78 | MoViNet-A1 | 89 | No | MoViNets: Mobile Video Networks for Efficient Vi... | 2021-03-21 | Code |
| 79 | VoV3D-M (32frames, from scratch, single) | 88.8 | No | Diverse Temporal Aggregation and Depthwise Spati... | 2020-12-01 | Code |
| 80 | VoV3D-L (16frames, from scratch, single) | 88.6 | No | Diverse Temporal Aggregation and Depthwise Spati... | 2020-12-01 | Code |
| 81 | MSNet-R50 (8 frames, ImageNet pretrained) | 88.4 | Yes | MotionSqueeze: Neural Motion Feature Learning fo... | 2020-07-20 | Code |
| 82 | VoV3D-M (16frames, from scratch, single) | 88.2 | No | Diverse Temporal Aggregation and Depthwise Spati... | 2020-12-01 | Code |
| 83 | MoViNet-A0 | 88.2 | No | MoViNets: Mobile Video Networks for Efficient Vi... | 2021-03-21 | Code |
| 84 | TAda2D (ResNet-50, 8 frames) | 88 | Yes | TAda! Temporally-Adaptive Convolutions for Video... | 2021-10-12 | Code |
| 85 | DirecFormer | 87.9 | No | DirecFormer: A Directed Attention in Transformer... | 2022-03-19 | Code |
| 86 | OmniVL | 86.2 | No | OmniVL:One Foundation Model for Image-Language a... | 2022-09-15 | - |
| 87 | CPNet Res34, 5 CP | 83.95 | No | Learning Video Representations from Corresponden... | 2019-05-20 | Code |
| 88 | 2-Stream TRN | 83.06 | No | Temporal Relational Reasoning in Videos | 2017-11-22 | Code |
| 89 | model3D_1 with left-right augmentation and fps jitter | 80.46 | No | The "something something" video database for lea... | 2017-06-13 | Code |
| 90 | Prob-Distill | 79.1 | No | Attention Distillation for Learning Video Repres... | 2019-04-05 | - |
| 91 | InternVideo2-6B | 12 | Yes | InternVideo2: Scaling Foundation Models for Mult... | 2024-03-22 | Code |