| 1 | VideoMAE V2-g | 88.7 | Yes | VideoMAE V2: Scaling Video Masked Autoencoders w... | 2023-03-29 | Code |
| 2 | DejaVid | 88.6 | Yes | - | - | Code |
| 3 | DEEP-HAL with ODF+SDF(I3D) | 87.56 | Yes | Self-supervising Action Recognition by Statistic... | 2020-01-14 | - |
| 4 | TO+MaxExp+IDT | 87.21 | Yes | High-order Tensor Pooling with Attention for Act... | 2021-10-11 | - |
| 5 | SCK⊕(I3D)+IDT | 86.11 | Yes | Tensor Representations for Action Recognition | 2020-12-28 | Code |
| 6 | SO+MaxExp+IDT | 85.7 | Yes | High-order Tensor Pooling with Attention for Act... | 2021-10-11 | - |
| 7 | R2+1D-BERT | 85.1 | Yes | Late Temporal Modeling in 3D CNN Architectures w... | 2020-08-03 | Code |
| 8 | Ours + ResNext101 BERT | 84.53 | No | Pose And Joint-Aware Action Recognition | 2020-10-16 | Code |
| 9 | SMART | 84.36 | No | SMART Frame Selection for Action Recognition | 2020-12-19 | - |
| 10 | OmniSource (SlowOnly-8x8-R101-RGB + I3D Flow) | 83.8 | Yes | Omni-sourced Webly-supervised Learning for Video... | 2020-03-29 | Code |
| 11 | ZeroI2V ViT-L/14 | 83.4 | Yes | ZeroI2V: Zero-Cost Adaptation of Pre-trained Tra... | 2023-10-02 | Code |
| 12 | PERF-Net (distilled S3D-G) | 83.2 | No | PERF-Net: Pose Empowered RGB-Flow Net | 2020-09-28 | - |
| 13 | BIKE | 83.1 | Yes | Bidirectional Cross-Modal Knowledge Exploration ... | 2022-12-31 | Code |
| 14 | BubbleNET | 82.6 | Yes | - | - | - |
| 15 | HAF+BoW/FV halluc | 82.48 | Yes | Hallucinating IDT Descriptors and I3D Optical Fl... | 2019-06-13 | - |
| 16 | CCS + TSN (ImageNet+Kinetics pretrained) | 81.9 | Yes | Cooperative Cross-Stream Network for Discriminat... | 2019-08-27 | - |
| 17 | RepFlow-50 ([2+1]D CNN, FcF, Non-local block) | 81.1 | No | Representation Flow for Action Recognition | 2018-10-02 | Code |
| 18 | Multi-stream I3D | 80.92 | No | - | - | - |
| 19 | MARS+RGB+FLow (64 frames, Kinetics pretrained) | 80.9 | Yes | - | - | Code |
| 20 | Two-stream I3D | 80.9 | Yes | Quo Vadis, Action Recognition? A New Model and t... | 2017-05-22 | Code |
| 21 | Two-Stream I3D (Imagenet+Kinetics pre-training) | 80.7 | Yes | Quo Vadis, Action Recognition? A New Model and t... | 2017-05-22 | Code |
| 22 | LGD-3D Two-stream | 80.5 | No | Learning Spatio-Temporal Representation with Loc... | 2019-06-13 | - |
| 23 | D3D + D3D | 80.5 | No | D3D: Distilled 3D Networks for Video Action Reco... | 2018-12-19 | Code |
| 24 | AMD(ViT-B/16) | 79.6 | Yes | Asymmetric Masked Distillation for Pre-Training ... | 2023-11-06 | - |
| 25 | D3D (Kinetics-600 pretraining) | 79.3 | No | D3D: Distilled 3D Networks for Video Action Reco... | 2018-12-19 | Code |
| 26 | LGD-3D Flow | 78.9 | No | Learning Spatio-Temporal Representation with Loc... | 2019-06-13 | - |
| 27 | Hidden Two-Stream | 78.7 | No | Hidden Two-Stream Convolutional Networks for Act... | 2017-04-02 | Code |
| 28 | R[2+1]D-TwoStream (Kinetics pretrained) | 78.7 | Yes | A Closer Look at Spatiotemporal Convolutions for... | 2017-11-30 | Code |
| 29 | D3D (Kinetics-400 pretraining) | 78.7 | No | D3D: Distilled 3D Networks for Video Action Reco... | 2018-12-19 | Code |
| 30 | I3D RGB + DMC-Net (I3D) | 77.8 | No | DMC-Net: Generating Discriminative Motion Cues f... | 2019-01-11 | - |
| 31 | BQN | 77.6 | No | Busy-Quiet Video Disentangling for Video Classif... | 2021-03-29 | Code |
| 32 | MSNet-R50 (16 frames, ImageNet pretrained) | 77.4 | No | MotionSqueeze: Neural Motion Feature Learning fo... | 2020-07-20 | Code |
| 33 | Flow-I3D (Kinetics pre-training) | 77.3 | Yes | Quo Vadis, Action Recognition? A New Model and t... | 2017-05-22 | Code |
| 34 | Flow-I3D (Imagenet+Kinetics pre-training) | 77.1 | Yes | Quo Vadis, Action Recognition? A New Model and t... | 2017-05-22 | Code |
| 35 | HATNet (32 frames) | 76.5 | No | Large Scale Holistic Video Understanding | 2019-04-25 | Code |
| 36 | R[2+1]D-Flow (Kinetics pretrained) | 76.4 | Yes | A Closer Look at Spatiotemporal Convolutions for... | 2017-11-30 | Code |
| 37 | S3D-G (ImageNet, Kinetics-400 pretrained) | 75.9 | No | Rethinking Spatiotemporal Feature Learning: Spee... | 2017-12-13 | Code |
| 38 | FASTER32 (Kinetics pretrain) | 75.7 | Yes | FASTER Recurrent Networks for Efficient Video Cl... | 2019-06-10 | - |
| 39 | LGD-3D RGB | 75.7 | No | Learning Spatio-Temporal Representation with Loc... | 2019-06-13 | - |
| 40 | RGB-I3D (Imagenet+Kinetics pre-training) | 74.8 | Yes | Quo Vadis, Action Recognition? A New Model and t... | 2017-05-22 | Code |
| 41 | R[2+1]D-RGB (Kinetics pretrained) | 74.5 | Yes | A Closer Look at Spatiotemporal Convolutions for... | 2017-11-30 | Code |
| 42 | VidTr-L | 74.4 | No | VidTr: Video Transformer Without Convolutions | 2021-04-23 | - |
| 43 | ADL+ResNet+IDT | 74.3 | No | Contrastive Video Representation Learning via Ad... | 2018-07-24 | - |
| 44 | RGB-I3D (Kinetics pre-training) | 74.3 | Yes | Quo Vadis, Action Recognition? A New Model and t... | 2017-05-22 | Code |
| 45 | Optical Flow Guided Feature | 74.2 | No | Optical Flow Guided Feature: A Fast and Robust M... | 2017-11-29 | Code |
| 46 | R[2+1D]D-TwoStream (Sports1M pretrained) | 72.7 | Yes | A Closer Look at Spatiotemporal Convolutions for... | 2017-11-30 | Code |
| 47 | TVNet+IDT | 72.6 | No | End-to-End Learning of Motion Representation for... | 2018-04-02 | Code |
| 48 | STM Network+IDT | 72.2 | No | - | - | Code |
| 49 | STM (ImageNet+Kinetics pretrain) | 72.2 | No | STM: SpatioTemporal and Motion Encoding for Acti... | 2019-08-07 | - |
| 50 | Prob-Distill | 72 | No | Attention Distillation for Learning Video Repres... | 2019-04-05 | - |
| 51 | DMC-Net (I3D) | 71.8 | No | DMC-Net: Generating Discriminative Motion Cues f... | 2019-01-11 | - |
| 52 | TesNet (ImageNet pretrained) | 71.5 | No | Learning spatio-temporal representations with te... | 2020-02-11 | - |
| 53 | HF-ECOLite (ImageNet+Kinetics pretrain) | 71.13 | Yes | Hierarchical Feature Aggregation Networks for Vi... | 2019-05-29 | - |
| 54 | ARTNet w/ TSN | 70.9 | No | Appearance-and-Relation Networks for Video Class... | 2017-11-24 | Code |
| 55 | ST-ResNet + IDT | 70.3 | No | Spatiotemporal Residual Networks for Video Actio... | 2016-11-07 | Code |
| 56 | R[2+1]D-Flow (Sports1M pretrained) | 70.1 | Yes | A Closer Look at Spatiotemporal Convolutions for... | 2017-11-30 | Code |
| 57 | Temporal Segment Networks | 69.4 | No | Temporal Segment Networks: Towards Good Practice... | 2016-08-02 | Code |
| 58 | TS-LSTM | 69 | No | TS-LSTM and Temporal-Inception: Exploiting Spati... | 2017-03-30 | Code |
| 59 | SVT | 67.2 | No | Self-supervised Video Transformer | 2021-12-02 | Code |
| 60 | R[2+1]D-RGB (Sports1M pretrained) | 66.6 | Yes | A Closer Look at Spatiotemporal Convolutions for... | 2017-11-30 | Code |
| 61 | TDD + IDT | 65.9 | No | Action Recognition with Trajectory-Pooled Deep-C... | 2015-05-19 | Code |
| 62 | VIMPAC | 65.9 | No | VIMPAC: Video Pre-Training via Masked Token Pred... | 2021-06-21 | Code |
| 63 | S:VGG-16, T:VGG-16 (ImageNet pretrained) | 65.4 | Yes | Convolutional Two-Stream Network Fusion for Vide... | 2016-04-22 | Code |
| 64 | Dynamic Image Networks + IDT | 65.2 | No | - | - | Code |
| 65 | LTC | 64.8 | No | Long-term Temporal Convolutions for Action Recog... | 2016-04-15 | Code |
| 66 | R-STAN-50 | 62.8 | No | - | - | - |
| 67 | DMC-Net (ResNet-18) | 62.8 | No | DMC-Net: Generating Discriminative Motion Cues f... | 2019-01-11 | - |
| 68 | SUSiNet (multi, Kinetics pretrained) | 62.7 | Yes | SUSiNet: See, Understand and Summarize it | 2018-12-03 | - |
| 69 | Two-Stream (ImageNet pretrained) | 59.4 | Yes | Two-Stream Convolutional Networks for Action Rec... | 2014-06-09 | Code |
| 70 | ActionFlowNet | 56.4 | No | ActionFlowNet: Learning Motion Representation fo... | 2016-12-09 | - |
| 71 | R-STAN-152 | 55.16 | No | - | - | - |
| 72 | Res3D | 54.9 | No | ConvNet Architecture Search for Spatiotemporal F... | 2017-08-16 | Code |
| 73 | R(2+1)D-18 (DistInit pretraining) | 54.8 | Yes | DistInit: Learning Video Representations Without... | 2019-01-26 | - |
| 74 | JRMN | 54.2 | No | Pose And Joint-Aware Action Recognition | 2020-10-16 | Code |
| 75 | CD-UAR | 51.8 | No | Towards Universal Representation for Unseen Acti... | 2018-03-22 | - |
| 76 | C3D | 51.6 | No | Learning Spatiotemporal Features with 3D Convolu... | 2014-12-02 | Code |
| 77 | R[2+1]D (VideoMoCo) | 49.2 | No | VideoMoCo: Contrastive Video Representation Lear... | 2021-03-10 | Code |
| 78 | 3D-ResNet-18 (VideoMoCo) | 43.6 | No | VideoMoCo: Contrastive Video Representation Lear... | 2021-03-10 | Code |