| 1 | FTP-UniFormerV2-L/14 | 99.7 | No | Enhancing Video Transformers for Action Understa... | 2024-03-24 | - |
| 2 | VideoMAE V2-g | 99.6 | Yes | VideoMAE V2: Scaling Video Masked Autoencoders w... | 2023-03-29 | Code |
| 3 | OmniVec | 99.6 | Yes | OmniVec: Learning robust representations with cr... | 2023-11-07 | - |
| 4 | OmniVec2 | 99.6 | Yes | - | - | - |
| 5 | VideoMAE V2-g | 99.6 | Yes | VideoMAE V2: Scaling Video Masked Autoencoders w... | 2023-03-29 | Code |
| 6 | BIKE | 98.8 | Yes | Bidirectional Cross-Modal Knowledge Exploration ... | 2022-12-31 | Code |
| 7 | SMART | 98.64 | No | SMART Frame Selection for Action Recognition | 2020-12-19 | - |
| 8 | OmniSource (SlowOnly-8x8-R101-RGB + I3D-Flow) | 98.6 | Yes | Omni-sourced Webly-supervised Learning for Video... | 2020-03-29 | Code |
| 9 | PERF-Net (multi-distilled S3D) | 98.6 | No | PERF-Net: Pose Empowered RGB-Flow Net | 2020-09-28 | - |
| 10 | ZeroI2V ViT-L/14 | 98.6 | Yes | ZeroI2V: Zero-Cost Adaptation of Pre-trained Tra... | 2023-10-02 | Code |
| 11 | LGD-3D Two-stream | 98.2 | No | Learning Spatio-Temporal Representation with Loc... | 2019-06-13 | - |
| 12 | Text4Vis | 98.2 | No | Revisiting Classifier: Transferring Vision-Langu... | 2022-07-04 | Code |
| 13 | Two-Stream I3D (Imagenet+Kinetics pre-training) | 98 | No | Quo Vadis, Action Recognition? A New Model and t... | 2017-05-22 | Code |
| 14 | MARS+RGB+Flow (64 frames, Kinetics pretrained) | 97.8 | Yes | - | - | Code |
| 15 | HATNet (32 frames) | 97.8 | No | Large Scale Holistic Video Understanding | 2019-04-25 | Code |
| 16 | Two-Stream I3D (Kinetics pre-training) | 97.8 | No | Quo Vadis, Action Recognition? A New Model and t... | 2017-05-22 | Code |
| 17 | BubbleNET | 97.62 | Yes | - | - | - |
| 18 | D3D + D3D | 97.6 | No | D3D: Distilled 3D Networks for Video Action Reco... | 2018-12-19 | Code |
| 19 | BQN | 97.6 | No | Busy-Quiet Video Disentangling for Video Classif... | 2021-03-29 | Code |
| 20 | MVD (ViT-B) | 97.5 | No | Masked Video Distillation: Rethinking Masked Fea... | 2022-12-08 | Code |
| 21 | CCS + TSN (ImageNet+Kinetics pretrained) | 97.4 | Yes | Cooperative Cross-Stream Network for Discriminat... | 2019-08-27 | - |
| 22 | R[2+1]D-TwoStream (Kinetics pretrained) | 97.3 | Yes | A Closer Look at Spatiotemporal Convolutions for... | 2017-11-30 | Code |
| 23 | SSL-KD (R21D-18) | 97.3 | No | A Large-Scale Analysis on Self-Supervised Video ... | 2023-06-09 | - |
| 24 | Multi-stream I3D | 97.2 | No | - | - | - |
| 25 | CA2ST(B/16) | 97.2 | No | CA^2ST: Cross-Attention in Audio, Space, and Tim... | 2025-03-30 | - |
| 26 | Hidden Two-Stream | 97.1 | No | Hidden Two-Stream Convolutional Networks for Act... | 2017-04-02 | Code |
| 27 | D3D (Kinetics-600 pretraining) | 97.1 | Yes | D3D: Distilled 3D Networks for Video Action Reco... | 2018-12-19 | Code |
| 28 | AMD(ViT-B/16) | 97.1 | Yes | Asymmetric Masked Distillation for Pre-Training ... | 2023-11-06 | - |
| 29 | D3D (Kinetics-400 pretraining) | 97 | Yes | D3D: Distilled 3D Networks for Video Action Reco... | 2018-12-19 | Code |
| 30 | LGD-3D RGB | 97 | No | Learning Spatio-Temporal Representation with Loc... | 2019-06-13 | - |
| 31 | STAM-32 (ImageNet/Kinetics pretraining) | 97 | Yes | An Image is Worth 16x16 Words, What is a Video W... | 2021-03-25 | Code |
| 32 | FASTER32 | 96.9 | No | FASTER Recurrent Networks for Efficient Video Cl... | 2019-06-10 | - |
| 33 | R[2+1]D-RGB (Kinetics pretrained) | 96.8 | Yes | A Closer Look at Spatiotemporal Convolutions for... | 2017-11-30 | Code |
| 34 | S3D-G (ImageNet, Kinetics-400 pretrained) | 96.8 | Yes | Rethinking Spatiotemporal Feature Learning: Spee... | 2017-12-13 | Code |
| 35 | LGD-3D Flow | 96.8 | No | Learning Spatio-Temporal Representation with Loc... | 2019-06-13 | - |
| 36 | Flow-I3D (Imagenet+Kinetics pre-training) | 96.7 | Yes | Quo Vadis, Action Recognition? A New Model and t... | 2017-05-22 | Code |
| 37 | VidTr-L | 96.7 | No | VidTr: Video Transformer Without Convolutions | 2021-04-23 | - |
| 38 | CMA iter1-S | 96.5 | No | Two-Stream Video Classification with Cross-Modal... | 2019-08-01 | - |
| 39 | Flow-I3D (Kinetics pre-training) | 96.5 | Yes | Quo Vadis, Action Recognition? A New Model and t... | 2017-05-22 | Code |
| 40 | I3D RGB + DMC-Net (I3D) | 96.5 | No | DMC-Net: Generating Discriminative Motion Cues f... | 2019-01-11 | - |
| 41 | M3Video | 96.5 | No | Masked Motion Encoding for Self-Supervised Video... | 2022-10-12 | Code |
| 42 | A2-Net (ResNet-50) | 96.4 | No | $A^2$-Nets: Double Attention Networks | 2018-10-27 | - |
| 43 | pBYOL | 96.3 | No | A Large-Scale Study on Unsupervised Spatiotempor... | 2021-04-29 | Code |
| 44 | STM (ImageNet+Kinetics pretrain) | 96.2 | No | STM: SpatioTemporal and Motion Encoding for Acti... | 2019-08-07 | - |
| 45 | VideoMAE | 96.1 | No | VideoMAE: Masked Autoencoders are Data-Efficient... | 2022-03-23 | Code |
| 46 | MF-Net, RGB only (ImageNet+Kinetics pretrained) | 96 | Yes | Multi-Fiber Networks for Video Recognition | 2018-07-30 | - |
| 47 | Optical Flow Guided Feature | 96 | No | Optical Flow Guided Feature: A Fast and Robust M... | 2017-11-29 | Code |
| 48 | MARS+RGB+Flow (16 frames) | 95.8 | No | - | - | Code |
| 49 | Prob-Distill | 95.7 | No | Attention Distillation for Learning Video Repres... | 2019-04-05 | - |
| 50 | RGB-I3D (Imagenet+Kinetics pre-training) | 95.6 | Yes | Quo Vadis, Action Recognition? A New Model and t... | 2017-05-22 | Code |
| 51 | R[2+1]D-Flow (Kinetics pretrained) | 95.5 | Yes | A Closer Look at Spatiotemporal Convolutions for... | 2017-11-30 | Code |
| 52 | TVNet+IDT | 95.4 | No | End-to-End Learning of Motion Representation for... | 2018-04-02 | Code |
| 53 | SCE (R3D-50) | 95.3 | No | Similarity Contrastive Estimation for Image and ... | 2022-12-21 | Code |
| 54 | TesNet (ImageNet pretrained) | 95.2 | Yes | Learning spatio-temporal representations with te... | 2020-02-11 | - |
| 55 | MMV TSM-50x2 | 95.2 | No | Self-Supervised MultiModal Versatile Networks | 2020-06-29 | Code |
| 56 | I3D-LSTM | 95.1 | No | - | - | Code |
| 57 | RGB-I3D (Kinetics pre-training) | 95.1 | Yes | Quo Vadis, Action Recognition? A New Model and t... | 2017-05-22 | Code |
| 58 | R[2+1]D-TwoStream (Sports-1M pretrained) | 95 | Yes | A Closer Look at Spatiotemporal Convolutions for... | 2017-11-30 | Code |
| 59 | X3D MobileNet-V3 LGD-GC | 94.85 | Yes | LIGAR: Lightweight General-purpose Action Recogn... | 2021-08-30 | Code |
| 60 | ST-ResNet + IDT | 94.6 | No | Spatiotemporal Residual Networks for Video Actio... | 2016-11-07 | Code |
| 61 | ResNeXt-101 (64f) | 94.5 | No | Can Spatiotemporal 3D CNNs Retrace the History o... | 2017-11-27 | Code |
| 62 | R-STAN-101 | 94.5 | No | - | - | - |
| 63 | TSN+TSM | 94.3 | No | Temporal-Spatial Mapping for Action Recognition | 2018-09-11 | - |
| 64 | ARTNet w/ TSN | 94.3 | No | Appearance-and-Relation Networks for Video Class... | 2017-11-24 | Code |
| 65 | Temporal Segment Networks | 94.2 | No | Temporal Segment Networks: Towards Good Practice... | 2016-08-02 | Code |
| 66 | TS-LSTM | 94.1 | No | TS-LSTM and Temporal-Inception: Exploiting Spati... | 2017-03-30 | Code |
| 67 | XKD (ViT-B/112/16) | 94.1 | No | XKD: Cross-modal Knowledge Distillation with Dom... | 2022-11-25 | Code |
| 68 | CVRL (R3D-152 2x; K600) | 93.9 | No | Spatiotemporal Contrastive Video Representation ... | 2020-08-09 | Code |
| 69 | SVT | 93.7 | No | Self-supervised Video Transformer | 2021-12-02 | Code |
| 70 | RSPNet | 93.7 | No | RSPNet: Relative Speed Perception for Unsupervis... | 2020-10-27 | Code |
| 71 | R[2+1]D-RGB (Sports-1M pretrained) | 93.6 | Yes | A Closer Look at Spatiotemporal Convolutions for... | 2017-11-30 | Code |
| 72 | Two-stream I3D | 93.4 | No | Quo Vadis, Action Recognition? A New Model and t... | 2017-05-22 | Code |
| 73 | CVRL (R3D-50; K600) | 93.4 | No | Spatiotemporal Contrastive Video Representation ... | 2020-08-09 | Code |
| 74 | VideoMS (ViT-B) | 93.4 | No | EVEREST: Efficient Masked Video Autoencoder by R... | 2022-11-19 | Code |
| 75 | XKD-Modality-Agnostic (ViT-B/112/16) | 93.4 | No | XKD: Cross-modal Knowledge Distillation with Dom... | 2022-11-25 | Code |
| 76 | R[2+1]D-Flow (Sports-1M pretrained) | 93.3 | Yes | A Closer Look at Spatiotemporal Convolutions for... | 2017-11-30 | Code |
| 77 | BraVe:V-FA (TSM-50x2) | 93.1 | No | Broaden Your Views for Self-Supervised Video Lea... | 2021-03-30 | Code |
| 78 | VIMPAC | 92.7 | No | VIMPAC: Video Pre-Training via Masked Token Pred... | 2021-06-21 | Code |
| 79 | S:VGG-16, T:VGG-16 (ImageNet pretrain) | 92.5 | Yes | Convolutional Two-Stream Network Fusion for Vide... | 2016-04-22 | Code |
| 80 | CrissCross (AudioSet) | 92.4 | No | Self-Supervised Audio-Visual Representation Lear... | 2021-11-09 | Code |
| 81 | DMC-Net (I3D) | 92.3 | No | DMC-Net: Generating Discriminative Motion Cues f... | 2019-01-11 | - |
| 82 | CVRL (R3D-50; K400) | 92.2 | No | Spatiotemporal Contrastive Video Representation ... | 2020-08-09 | Code |
| 83 | two-in-one two stream | 92 | No | Dance with Flow: Two-in-One Stream Action Detect... | 2019-04-01 | Code |
| 84 | LTC | 91.7 | No | Long-term Temporal Convolutions for Action Recog... | 2016-04-15 | Code |
| 85 | R-STAN-50 | 91.5 | No | - | - | - |
| 86 | TDD + IDT | 91.5 | No | Action Recognition with Trajectory-Pooled Deep-C... | 2015-05-19 | Code |
| 87 | AVID+CMA (Modified R2+1D-18 on Audioset) | 91.5 | No | Audio-Visual Instance Discrimination with Cross-... | 2020-04-27 | Code |
| 88 | CrissCross (Kinetics400) | 91.5 | No | Self-Supervised Audio-Visual Representation Lear... | 2021-11-09 | Code |
| 89 | Very deep two-stream ConvNet | 91.4 | No | Towards Good Practices for Very Deep Two-Stream ... | 2015-07-08 | Code |
| 90 | VideoMAE(no extra data) | 91.3 | No | VideoMAE: Masked Autoencoders are Data-Efficient... | 2022-03-23 | Code |
| 91 | 3D ResNeXt-101 + Confidence Distillation | 91.2 | No | Efficient Action Recognition Using Confidence Di... | 2021-09-05 | - |
| 92 | MR Two-Sream R-CNN | 91.1 | No | - | - | - |
| 93 | AVID (Modified R2+1D-18 on Audioset) | 91 | No | Audio-Visual Instance Discrimination with Cross-... | 2020-04-27 | Code |
| 94 | ViCC (S3D; R+F) | 90.5 | No | Self-supervised Video Representation Learning wi... | 2021-06-18 | Code |
| 95 | Dynamic Image Networks + IDT | 89.1 | No | - | - | Code |
| 96 | ViCC (S3D; RGB) | 88.8 | No | Self-supervised Video Representation Learning wi... | 2021-06-18 | Code |
| 97 | ViCC (R2+1D; R+F) | 88.8 | No | Self-supervised Video Representation Learning wi... | 2021-06-18 | Code |
| 98 | Two-stream+LSTM | 88.6 | No | Beyond Short Snippets: Deep Networks for Video C... | 2015-03-31 | Code |
| 99 | P3D (ImageNet + Sports1M) | 88.6 | Yes | Learning Spatio-Temporal Representation with Pse... | 2017-11-28 | Code |
| 100 | CrissCross (Kinetics-Sound) | 88.3 | No | Self-Supervised Audio-Visual Representation Lear... | 2021-11-09 | Code |
| 101 | Two-Stream (ImageNet pretrained) | 88 | Yes | Two-Stream Convolutional Networks for Action Rec... | 2014-06-09 | Code |
| 102 | AVID+CMA (Modified R2+1D-18 on Kinetics) | 87.5 | No | Audio-Visual Instance Discrimination with Cross-... | 2020-04-27 | Code |
| 103 | AVID (Modified R2+1D-18 on Kinetics) | 86.9 | No | Audio-Visual Instance Discrimination with Cross-... | 2020-04-27 | Code |
| 104 | MV-CNN | 86.4 | No | Real-time Action Recognition with Enhanced Motio... | 2016-04-26 | Code |
| 105 | Dynamics 2 for DenseNet-201 Transformer | 86.1 | No | Video Action Recognition Collaborative Learning ... | 2023-02-17 | Code |
| 106 | R(2+1)D-18 (DistInit pretraining) | 85.8 | No | DistInit: Learning Video Representations Without... | 2019-01-26 | - |
| 107 | Res3D | 85.8 | No | ConvNet Architecture Search for Spatiotemporal F... | 2017-08-16 | Code |
| 108 | MCN (R3D-18; RGB) | 85.4 | No | Self-Supervised Video Representation Learning wi... | 2021-08-19 | - |
| 109 | MCN (R2+1D; RGB) | 84.8 | No | Self-Supervised Video Representation Learning wi... | 2021-08-19 | - |
| 110 | ActionFlowNet | 83.9 | No | ActionFlowNet: Learning Motion Representation fo... | 2016-12-09 | - |
| 111 | ViCC (R2+1D; RGB) | 82.8 | No | Self-supervised Video Representation Learning wi... | 2021-06-18 | Code |
| 112 | TCLR (R3D-18) | 82.4 | No | TCLR: Temporal Contrastive Learning for Video Re... | 2021-01-20 | Code |
| 113 | C3D | 82.3 | No | Learning Spatiotemporal Features with 3D Convolu... | 2014-12-02 | Code |
| 114 | PCL (ResNet-18) | 82.3 | No | Pretext-Contrastive Learning: Toward Good Practi... | 2020-10-29 | Code |
| 115 | HalluciNet (ResNet-50) | 79.83 | No | HalluciNet-ing Spatiotemporal Representations Us... | 2019-12-10 | Code |
| 116 | R[2+1]D (VideoMoCo) | 78.7 | No | VideoMoCo: Contrastive Video Representation Lear... | 2021-03-10 | Code |
| 117 | DPC (Modified 3D Resnet-34) | 75.7 | No | Video Representation Learning by Dense Predictiv... | 2019-09-10 | Code |
| 118 | 3D-SqueezeNet | 74.94 | No | Resource Efficient 3D Convolutional Neural Netwo... | 2019-04-04 | Code |
| 119 | CoCLR | 74.5 | No | Self-supervised Co-training for Video Representa... | 2020-10-19 | Code |
| 120 | IIC (R3D) | 74.4 | No | Self-supervised Video Representation Learning Us... | 2020-08-06 | Code |
| 121 | 3D-ResNet-18 (VideoMoCo) | 74.1 | No | VideoMoCo: Contrastive Video Representation Lear... | 2021-03-10 | Code |
| 122 | ViCC (S3D; RGB) | 72.2 | No | Self-supervised Video Representation Learning wi... | 2021-06-18 | Code |
| 123 | TCE (ResNet-50) | 71.2 | No | Temporally Coherent Embeddings for Self-Supervis... | 2020-03-21 | Code |
| 124 | TCE (ResNet-18, Split 1) | 68.8 | No | Temporally Coherent Embeddings for Self-Supervis... | 2020-03-21 | Code |
| 125 | DPC (3D ResNet-18) | 68.2 | No | Video Representation Learning by Dense Predictiv... | 2019-09-10 | Code |
| 126 | TCE (ResNet18, Split 1) | 68.2 | No | Temporally Coherent Embeddings for Self-Supervis... | 2020-03-21 | Code |
| 127 | VCP (R3D) | 66 | No | Video Cloze Procedure for Self-Supervised Spatio... | 2020-01-02 | Code |
| 128 | 3D Cubic Puzzles (3D ResNet-18) | 65.8 | No | Self-Supervised Video Representation Learning wi... | 2018-11-24 | - |
| 129 | Slow Fusion + Finetune top 3 layers | 65.4 | Yes | - | - | Code |
| 130 | Video Clip Ordering (R3D) | 64.9 | No | - | - | - |
| 131 | Skip-Clip (3D ResNet-18) | 64.4 | No | Skip-Clip: Self-Supervised Spatiotemporal Repres... | 2019-10-28 | - |
| 132 | MLGCN | 63.27 | No | - | - | - |
| 133 | 3D RotNet (3D ResNet-18) | 62.9 | No | Self-Supervised Spatiotemporal Feature Learning ... | 2018-11-28 | - |
| 134 | DPC (3D ResNet-18, Split 1) | 60.6 | No | Video Representation Learning by Dense Predictiv... | 2019-09-10 | Code |
| 135 | O3N (AlexNet) | 60.3 | No | Self-Supervised Video Representation Learning Wi... | 2016-11-21 | - |
| 136 | Contrastive Multiview Coding (CaffeNet x2) | 59.1 | No | Contrastive Multiview Coding | 2019-06-13 | Code |
| 137 | Motion & Appearance (C3D) | 58.8 | No | Self-supervised Spatio-temporal Representation L... | 2019-04-07 | Code |
| 138 | 3D-ShuffleNetV2 0.25x | 56.52 | No | Resource Efficient 3D Convolutional Neural Netwo... | 2019-04-04 | Code |
| 139 | 3D-MobileNetV2 0.2x | 55.56 | No | Resource Efficient 3D Convolutional Neural Netwo... | 2019-04-04 | Code |
| 140 | Arrow of Time (AlexNet) | 55.3 | No | - | - | - |
| 141 | VideoGan (C3D) | 52.1 | No | Generating Videos with Scene Dynamics | 2016-09-08 | - |
| 142 | Shuffle and Learn (AlexNet) | 50.9 | No | Shuffle and Learn: Unsupervised Learning using T... | 2016-03-28 | - |
| 143 | Baseline UCF101 | 43.9 | No | UCF101: A Dataset of 101 Human Actions Classes F... | 2012-12-03 | Code |
| 144 | CD-UAR | 42.5 | No | Towards Universal Representation for Unseen Acti... | 2018-03-22 | - |
| 145 | SL | 35.2 | No | - | - | - |
| 146 | I3D + PoTion | 29.3 | No | - | - | - |