| 1 | InternVideo | 70 | Yes | InternVideo: General Video Foundation Models via... | 2022-12-06 | Code |
| 2 | VideoMAE V2-g | 68.7 | Yes | VideoMAE V2: Scaling Video Masked Autoencoders w... | 2023-03-29 | Code |
| 3 | Side4Video (EVA ViT-E/14 | 67.3 | No | Side4Video: Spatial-Temporal Side Network for Me... | 2023-11-27 | Code |
| 4 | ATM | 65.6 | No | What Can Simple Arithmetic Operations Do for Tem... | 2023-07-18 | Code |
| 5 | TAdaFormer-L/14 | 63.7 | Yes | Temporally-Adaptive Models for Efficient Video U... | 2023-08-10 | Code |
| 6 | TDS-CLIP-ViT-L/14(8frames) | 63 | No | TDS-CLIP: Temporal Difference Side Network for I... | 2024-08-20 | Code |
| 7 | UniFormerV2-L | 62.7 | Yes | - | - | Code |
| 8 | StructVit-B-4-1 | 61.3 | No | Learning Correlation Structures for Vision Trans... | 2024-04-05 | - |
| 9 | UniFormer-B (IN-1K + Kinetics400) | 60.9 | No | - | - | Code |
| 10 | TAdaConvNeXtV2-B | 60.7 | Yes | Temporally-Adaptive Models for Efficient Video U... | 2023-08-10 | Code |
| 11 | TPS | 58.3 | No | Spatiotemporal Self-attention Modeling with Temp... | 2022-07-27 | Code |
| 12 | MSMA (8+16frames) | 57.9 | No | - | - | - |
| 13 | UniFormer-B (IN-1K + Kinetics600) | 57.6 | No | - | - | Code |
| 14 | SIFA | 57.3 | No | Stand-Alone Inter-Frame Attention in Video Models | 2022-06-14 | Code |
| 15 | EAN ResNet50 (single clip, center crop,8+16 ensemble, with sparse Transformer) | 57.2 | No | EAN: Event Adaptive Network for Enhanced Action ... | 2021-07-22 | Code |
| 16 | TCM (Ensemble) | 57.2 | No | Motion-driven Visual Tempo Learning for Video-ba... | 2022-02-24 | Code |
| 17 | BQNEn (ImageNet + K400 pretrained) | 57.1 | No | Busy-Quiet Video Disentangling for Video Classif... | 2021-03-29 | Code |
| 18 | TDN ResNet101 (one clip, center crop, 8+16 ensemble, ImageNet pretrained, RGB only) | 56.8 | No | TDN: Temporal Difference Networks for Efficient ... | 2020-12-18 | Code |
| 19 | SELFYNet-TSM-R50En (8+16 frames, ImageNet pretrained, 2 clips) | 56.6 | Yes | Learning Self-Similarity in Space and Time as Ge... | 2021-02-14 | Code |
| 20 | CT-Net Ensemble (R50, 8+12+16+24) | 56.6 | No | CT-Net: Channel Tensorization Network for Video ... | 2021-06-03 | Code |
| 21 | MoDS (8+16frames) | 56.6 | No | - | - | - |
| 22 | MLP-3D | 56.5 | No | MLP-3D: A MLP-like 3D Architecture with Grouped ... | 2022-06-13 | - |
| 23 | RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips) | 56.1 | No | Relational Self-Attention: What's Missing in Att... | 2021-11-02 | Code |
| 24 | SELFYNet-TSM-R50En (8+16 frames, ImageNet pretrained, a single clip) | 55.8 | Yes | Learning Self-Similarity in Space and Time as Ge... | 2021-02-14 | Code |
| 25 | RSANet-R50 (8+16 frames, ImageNet pretrained, a single clip) | 55.5 | No | Relational Self-Attention: What's Missing in Att... | 2021-11-02 | Code |
| 26 | PAN ResNet101 (RGB only, no Flow) | 55.3 | No | PAN: Towards Fast Action Recognition via Learnin... | 2020-08-08 | Code |
| 27 | GSM Ensemble InceptionV3 (ImageNet pretrained) | 55.16 | Yes | Gate-Shift Networks for Video Action Recognition | 2019-12-01 | Code |
| 28 | MSNet-R50En (ensemble) | 55.1 | Yes | MotionSqueeze: Neural Motion Feature Learning fo... | 2020-07-20 | Code |
| 29 | AE-Net (8+16frames) | 55 | No | - | - | - |
| 30 | VoV3D-L (32frames, Kinetics pretrained, single) | 54.59 | Yes | Diverse Temporal Aggregation and Depthwise Spati... | 2020-12-01 | Code |
| 31 | MSNet-R50En (8+16 ensemble, ImageNet pretrained) | 54.4 | Yes | MotionSqueeze: Neural Motion Feature Learning fo... | 2020-07-20 | Code |
| 32 | SELFYNet-TSM-R50 (16 frames, ImageNet pretrained) | 54.3 | Yes | Learning Self-Similarity in Space and Time as Ge... | 2021-02-14 | Code |
| 33 | RNL+TSM Ensemble(R50+R101, ImageNet pretrained) | 54.1 | No | Region-based Non-local Operation for Video Class... | 2020-07-17 | Code |
| 34 | RSANet-R50 (16 frames, ImageNet pretrained, a single clip) | 54 | No | Relational Self-Attention: What's Missing in Att... | 2021-11-02 | Code |
| 35 | MVFNet-R50EN | 54 | No | MVFNet: Multi-View Fusion Network for Efficient ... | 2020-12-13 | Code |
| 36 | STPG (8+16frames) | 53.5 | No | - | - | - |
| 37 | GB + DF + LB (ResNet152, ImageNet pretrained) | 53.4 | Yes | Action recognition with spatial-temporal discrim... | 2019-08-20 | - |
| 38 | ip-CSN-152 (IG-65M pretraining) | 53.3 | No | Video Classification with Channel-Separated Conv... | 2019-04-04 | Code |
| 39 | MARS+RGB+Flow (64 frames, Kinetics pretrained) | 53 | Yes | - | - | Code |
| 40 | RNL+TSM Ensemble(ResNet50, ImageNet pretrained) | 52.7 | No | Region-based Non-local Operation for Video Class... | 2020-07-17 | Code |
| 41 | VoV3D-M (32frames, Kinetics pretrained, single) | 52.68 | Yes | Diverse Temporal Aggregation and Depthwise Spati... | 2020-12-01 | Code |
| 42 | TSM+W3 (16 frames, ResNet50) | 52.6 | No | Knowing What, Where and When to Look: Efficient ... | 2020-04-02 | - |
| 43 | AK-Net | 52.5 | No | Action Keypoint Network for Efficient Video Reco... | 2022-01-17 | - |
| 44 | MSNet-R50 (16 frames, ImageNet pretrained) | 52.1 | Yes | MotionSqueeze: Neural Motion Feature Learning fo... | 2020-07-20 | Code |
| 45 | ir-CSN-152 (IG-65M pretraining) | 52.1 | No | Video Classification with Channel-Separated Conv... | 2019-04-04 | Code |
| 46 | RSANet-R50 (8 frames, ImageNet pretrained, a single clip) | 51.9 | No | Relational Self-Attention: What's Missing in Att... | 2021-11-02 | Code |
| 47 | GSM InceptionV3 (16 frames, ImageNet pretrained) | 51.68 | Yes | Gate-Shift Networks for Video Action Recognition | 2019-12-01 | Code |
| 48 | R(2+1)D-152 (IG-65M pretraining) | 51.6 | No | Video Classification with Channel-Separated Conv... | 2019-04-04 | Code |
| 49 | MSNet-R50 (8 frames, ImageNet pretrained) | 50.9 | No | MotionSqueeze: Neural Motion Feature Learning fo... | 2020-07-20 | Code |
| 50 | TSM (RGB + Flow) | 50.7 | No | TSM: Temporal Shift Module for Efficient Video U... | 2018-11-20 | Code |
| 51 | STM (16 frames, ImageNet pretraining) | 50.7 | No | STM: SpatioTemporal and Motion Encoding for Acti... | 2019-08-07 | - |
| 52 | VoV3D-L (32frames, from scratch, single) | 50.6 | No | Diverse Temporal Aggregation and Depthwise Spati... | 2020-12-01 | Code |
| 53 | ResNet50 I3D (Moments pretrained) | 50 | Yes | Moments in Time Dataset: one million videos for ... | 2018-01-09 | Code |
| 54 | VoV3D-M (32frames, from scratch, single) | 49.8 | No | Diverse Temporal Aggregation and Depthwise Spati... | 2020-12-01 | Code |
| 55 | TSMEn | 49.7 | No | TSM: Temporal Shift Module for Efficient Video U... | 2018-11-20 | Code |
| 56 | TRG (Inception-V3) | 49.7 | No | Temporal Reasoning Graph for Activity Recognition | 2019-08-27 | - |
| 57 | TRG (ResNet-50) | 49.5 | No | Temporal Reasoning Graph for Activity Recognition | 2019-08-27 | - |
| 58 | VoV3D-L (16frames, from scratch, single) | 49.5 | No | Diverse Temporal Aggregation and Depthwise Spati... | 2020-12-01 | Code |
| 59 | ir-CSN-152 | 49.3 | No | Video Classification with Channel-Separated Conv... | 2019-04-04 | Code |
| 60 | RSTG (Kinetics pretrained) | 49.2 | Yes | Recurrent Space-time Graph Neural Networks | 2019-04-11 | Code |
| 61 | ResNet50 I3D (Kinetics pretrained) | 48.6 | Yes | Moments in Time Dataset: one million videos for ... | 2018-01-09 | Code |
| 62 | ir-CSN-101 | 48.4 | No | Video Classification with Channel-Separated Conv... | 2019-04-04 | Code |
| 63 | S3D-G (ImageNet pretrained) | 48.2 | Yes | Rethinking Spatiotemporal Feature Learning: Spee... | 2017-12-13 | Code |
| 64 | VoV3D-M (16frames, from scratch, single) | 48.1 | No | Diverse Temporal Aggregation and Depthwise Spati... | 2020-12-01 | Code |
| 65 | S3D | 47.3 | No | Rethinking Spatiotemporal Feature Learning: Spee... | 2017-12-13 | Code |
| 66 | TSM | 47.2 | No | TSM: Temporal Shift Module for Efficient Video U... | 2018-11-20 | Code |
| 67 | ECO-Net (ImageNet pretrained) | 46.4 | Yes | ECO: Efficient Convolutional Network for Online ... | 2018-04-24 | Code |
| 68 | ECO-Net | 46.4 | No | ECO: Efficient Convolutional Network for Online ... | 2018-04-24 | Code |
| 69 | NL I3D + GCN | 46.1 | No | Videos as Space-Time Region Graphs | 2018-06-05 | - |
| 70 | NL I3D | 44.4 | No | Non-local Neural Networks | 2017-11-21 | Code |
| 71 | Motion Feature Net | 43.9 | No | Motion Feature Network: Fixed Motion Filter for ... | 2018-07-26 | - |
| 72 | Motion Feature Net | 43.9 | No | Motion Feature Network: Fixed Motion Filter for ... | 2018-07-26 | - |
| 73 | 2-Stream TRN | 42.01 | No | Temporal Relational Reasoning in Videos | 2017-11-22 | Code |
| 74 | 2-Stream TRN | 42.01 | No | Temporal Relational Reasoning in Videos | 2017-11-22 | Code |
| 75 | HF-TSN (ImageNet pretraining) | 41.97 | Yes | Hierarchical Feature Aggregation Networks for Vi... | 2019-05-29 | - |
| 76 | MARS+RGB+Flow (16 frames, Kinetics pretrained) | 40.4 | No | - | - | Code |
| 77 | M-TRN | 34.4 | No | Temporal Relational Reasoning in Videos | 2017-11-22 | Code |