| 1 | TokenLearner | 66.3 | No | TokenLearner: What Can 8 Learned Tokens Do for I... | 2021-06-21 | Code |
| 2 | TubeViT-L | 66.2 | No | Rethinking Video ViTs: Sparse Video Tubes for Jo... | 2022-12-06 | Code |
| 3 | MoViNet-A6 | 63.2 | No | MoViNets: Mobile Video Networks for Efficient Vi... | 2021-03-21 | Code |
| 4 | DEEP-HAL with ODF+SDF (AssembleNet++) | 62.29 | No | Self-supervising Action Recognition by Statistic... | 2020-01-14 | - |
| 5 | AssembleNet++ 50 | 59.8 | No | AssembleNet++: Assembling Modality Representatio... | 2020-08-18 | Code |
| 6 | AssembleNet | 58.6 | Yes | AssembleNet: Searching for Multi-Stream Neural C... | 2019-05-30 | Code |
| 7 | AssembleNet-101 | 58.6 | No | AssembleNet: Searching for Multi-Stream Neural C... | 2019-05-30 | Code |
| 8 | VicTR (ViT-L/14) | 57.6 | No | VicTR: Video-conditioned Text Representations fo... | 2023-04-05 | - |
| 9 | AssembleNet++ 50 without object | 54.98 | No | AssembleNet++: Assembling Modality Representatio... | 2020-08-18 | Code |
| 10 | BIKE | 50.7 | No | Bidirectional Cross-Modal Knowledge Exploration ... | 2022-12-31 | Code |
| 11 | DEEP-HAL with ODF+SDF (I3D) | 50.16 | No | Self-supervising Action Recognition by Statistic... | 2020-01-14 | - |
| 12 | MoViNet-A4 | 48.5 | No | MoViNets: Mobile Video Networks for Efficient Vi... | 2021-03-21 | Code |
| 13 | AdaFocus (weak supervision, MViT-B-24, 32x3) | 47.8 | No | Towards Weakly Supervised End-to-end Learning fo... | 2023-11-28 | - |
| 14 | MViT-B-24, 32x3 (Kinetics-600 pretraining) | 47.7 | No | Multiscale Vision Transformers | 2021-04-22 | Code |
| 15 | En-VidTr-L | 47.3 | No | VidTr: Video Transformer Without Convolutions | 2021-04-23 | - |
| 16 | MViT-B, 32x3 (Kinetics-600 pretraining) | 47.1 | No | Multiscale Vision Transformers | 2021-04-22 | Code |
| 17 | MViT-B-24, 32x3 (Kinetics-400 pretraining) | 46.3 | No | Multiscale Vision Transformers | 2021-04-22 | Code |
| 18 | SlowFast (Kinetics-600 pretraining, NL) | 45.2 | No | SlowFast Networks for Video Recognition | 2018-12-10 | Code |
| 19 | MViT-B, 32x3 (Kinetics-400 pretraining) | 44.3 | No | Multiscale Vision Transformers | 2021-04-22 | Code |
| 20 | ActionCLIP (ViT-B/16) | 44.3 | No | ActionCLIP: A New Paradigm for Video Action Reco... | 2021-09-17 | Code |
| 21 | MViT-B, 16x4 (Kinetics-600 pretraining) | 43.9 | No | Multiscale Vision Transformers | 2021-04-22 | Code |
| 22 | VidTr-L | 43.5 | No | VidTr: Video Transformer Without Convolutions | 2021-04-23 | - |
| 23 | JMRN + R101-NL-LFB | 43.23 | No | Pose And Joint-Aware Action Recognition | 2020-10-16 | Code |
| 24 | HAF+BoW/FV/OFF halluc. +MSK×8/PN | 43.1 | No | Hallucinating IDT Descriptors and I3D Optical Fl... | 2019-06-13 | - |
| 25 | LFB | 42.5 | Yes | Long-Term Feature Banks for Detailed Video Under... | 2018-12-12 | Code |
| 26 | SlowFast (Kinetics-400 pretraining, NL) | 42.5 | No | SlowFast Networks for Video Recognition | 2018-12-10 | Code |
| 27 | SlowFast (Kinetics-600 pretraining) | 42.1 | No | SlowFast Networks for Video Recognition | 2018-12-10 | Code |
| 28 | AdaFocus (weak supervision, MViT-B-K400-pretrain, 16x4) | 41.4 | No | Towards Weakly Supervised End-to-end Learning fo... | 2023-11-28 | - |
| 29 | AdaFocus (weak supervision, X3D-L, 32x3) | 41.2 | No | Towards Weakly Supervised End-to-end Learning fo... | 2023-11-28 | - |
| 30 | Timeception (R3D) | 41.1 | No | Timeception for Complex Action Recognition | 2018-12-04 | Code |
| 31 | PA3D + (GCN + I3D + NL I3D) | 41 | No | - | - | - |
| 32 | PoTion + (GCN + I3D + NL I3D) | 40.8 | No | - | - | - |
| 33 | MViT-B, 16x4 (Kinetics-400 pretraining) | 40 | No | Multiscale Vision Transformers | 2021-04-22 | Code |
| 34 | STRG | 39.7 | Yes | Videos as Space-Time Region Graphs | 2018-06-05 | - |
| 35 | AdaFocus (weak supervision, Slowfast-R50, 16x8) | 39.3 | No | Towards Weakly Supervised End-to-end Learning fo... | 2023-11-28 | - |
| 36 | STLT + I3D | 38.5 | No | Revisiting spatio-temporal layouts for compositi... | 2021-11-02 | Code |
| 37 | EvaNet | 38.1 | Yes | Evolving Space-Time Neural Architectures for Vid... | 2018-11-26 | - |
| 38 | Timeception (I3D) | 37.2 | No | Timeception for Complex Action Recognition | 2018-12-04 | Code |
| 39 | I3D | 32.9 | No | Quo Vadis, Action Recognition? A New Model and t... | 2017-05-22 | Code |
| 40 | MoViNet-A2 | 32.5 | No | MoViNets: Mobile Video Networks for Efficient Vi... | 2021-03-21 | Code |
| 41 | Timeception (R2D) | 31.6 | No | Timeception for Complex Action Recognition | 2018-12-04 | Code |
| 42 | MultiScale TRN | 25.2 | Yes | Temporal Relational Reasoning in Videos | 2017-11-22 | Code |
| 43 | Co Slow_64 | 25.2 | No | Continual 3D Convolutional Neural Networks for R... | 2021-05-31 | Code |
| 44 | Slow-8×8 | 24.1 | No | Continual 3D Convolutional Neural Networks for R... | 2021-05-31 | Code |
| 45 | Asyn-TF | 22.4 | Yes | Asynchronous Temporal Fields for Action Recognit... | 2016-12-19 | Code |
| 46 | CoViAR | 21.9 | Yes | Compressed Video Action Recognition | 2017-12-02 | Code |
| 47 | Co Slow_8 | 21.5 | No | Continual 3D Convolutional Neural Networks for R... | 2021-05-31 | Code |
| 48 | 2-Strm | 18.6 | No | Two-Stream Convolutional Networks for Action Rec... | 2014-06-09 | Code |
| 49 | JMRN (Pose only) | 16.2 | No | Pose And Joint-Aware Action Recognition | 2020-10-16 | Code |