Metric: Top 5 Accuracy (higher is better)
| # | Model↕ | Top 5 Accuracy▼ | Extra Data | Paper | Date↕ | Code |
|---|---|---|---|---|---|---|
| 1 | VideoMAE V2-g | 91.9 | Yes | VideoMAE V2: Scaling Video Masked Autoencoders w... | 2023-03-29 | Code |
| 2 | Side4Video (EVA ViT-E/14 | 88.8 | No | Side4Video: Spatial-Temporal Side Network for Me... | 2023-11-27 | Code |
| 3 | ATM | 88.6 | No | What Can Simple Arithmetic Operations Do for Tem... | 2023-07-18 | Code |
| 4 | UniFormerV2-L | 88 | Yes | - | - | Code |
| 5 | TDS-CLIP-ViT-L/14(8frames) | 87.8 | No | TDS-CLIP: Temporal Difference Side Network for I... | 2024-08-20 | Code |
| 6 | UniFormer-B (IN-1K + Kinetics400) | 87.3 | No | - | - | Code |
| 7 | TRG (ResNet-50) | 86.1 | No | Temporal Reasoning Graph for Activity Recognition | 2019-08-27 | - |
| 8 | UniFormer-B (IN-1K + Kinetics600) | 84.9 | No | - | - | Code |
| 9 | SELFYNet-TSM-R50En (8+16 frames, ImageNet pretrained, 2 clips) | 84.4 | Yes | Learning Self-Similarity in Space and Time as Ge... | 2021-02-14 | Code |
| 10 | BQNEn (ImageNet + K400 pretrained) | 84.2 | No | Busy-Quiet Video Disentangling for Video Classif... | 2021-03-29 | Code |
| 11 | TDN ResNet101 (one clip, center crop, 8+16 ensemble, ImageNet pretrained, RGB only) | 84.1 | No | TDN: Temporal Difference Networks for Efficient ... | 2020-12-18 | Code |
| 12 | EAN ResNet50 (single clip, center crop,8+16 ensemble, with sparse Transformer) | 83.9 | No | EAN: Event Adaptive Network for Enhanced Action ... | 2021-07-22 | Code |
| 13 | SELFYNet-TSM-R50En (8+16 frames, ImageNet pretrained, a single clip) | 83.9 | Yes | Learning Self-Similarity in Space and Time as Ge... | 2021-02-14 | Code |
| 14 | MSNet-R50En (8+16 ensemble, ImageNet pretrained) | 83.8 | Yes | MotionSqueeze: Neural Motion Feature Learning fo... | 2020-07-20 | Code |
| 15 | SELFYNet-TSM-R50 (16 frames, ImageNet pretrained) | 82.9 | Yes | Learning Self-Similarity in Space and Time as Ge... | 2021-02-14 | Code |
| 16 | RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips) | 82.8 | No | Relational Self-Attention: What's Missing in Att... | 2021-11-02 | Code |
| 17 | PAN ResNet101 (RGB only, no Flow) | 82.8 | No | PAN: Towards Fast Action Recognition via Learnin... | 2020-08-08 | Code |
| 18 | RSANet-R50 (8+16 frames, ImageNet pretrained, a single clip) | 82.6 | No | Relational Self-Attention: What's Missing in Att... | 2021-11-02 | Code |
| 19 | VoV3D-L (32frames, Kinetics pretrained, single) | 82.3 | Yes | Diverse Temporal Aggregation and Depthwise Spati... | 2020-12-01 | Code |
| 20 | MSNet-R50 (16 frames, ImageNet pretrained) | 82.3 | Yes | MotionSqueeze: Neural Motion Feature Learning fo... | 2020-07-20 | Code |
| 21 | RNL+TSM Ensemble(R50+R101, ImageNet pretrained) | 82.2 | No | Region-based Non-local Operation for Video Class... | 2020-07-17 | Code |
| 22 | RNL+TSM Ensemble(ResNet50, ImageNet pretrained) | 81.5 | No | Region-based Non-local Operation for Video Class... | 2020-07-17 | Code |
| 23 | TSM+W3 (16 frames, ResNet50) | 81.3 | No | Knowing What, Where and When to Look: Efficient ... | 2020-04-02 | - |
| 24 | RSANet-R50 (16 frames, ImageNet pretrained, a single clip) | 81.1 | No | Relational Self-Attention: What's Missing in Att... | 2021-11-02 | Code |
| 25 | VoV3D-M (32frames, Kinetics pretrained, single) | 80.43 | Yes | Diverse Temporal Aggregation and Depthwise Spati... | 2020-12-01 | Code |
| 26 | MSNet-R50 (8 frames, ImageNet pretrained) | 80.3 | No | MotionSqueeze: Neural Motion Feature Learning fo... | 2020-07-20 | Code |
| 27 | RSANet-R50 (8 frames, ImageNet pretrained, a single clip) | 79.6 | No | Relational Self-Attention: What's Missing in Att... | 2021-11-02 | Code |
| 28 | VoV3D-L (32frames, from scratch, single) | 78.7 | No | Diverse Temporal Aggregation and Depthwise Spati... | 2020-12-01 | Code |
| 29 | S3D-G (ImageNet pretrained) | 78.7 | Yes | Rethinking Spatiotemporal Feature Learning: Spee... | 2017-12-13 | Code |
| 30 | TSMEn | 78.5 | No | TSM: Temporal Shift Module for Efficient Video U... | 2018-11-20 | Code |
| 31 | S3D | 78.1 | No | Rethinking Spatiotemporal Feature Learning: Spee... | 2017-12-13 | Code |
| 32 | VoV3D-M (32frames, from scratch, single) | 78 | No | Diverse Temporal Aggregation and Depthwise Spati... | 2020-12-01 | Code |
| 33 | VoV3D-L (16frames, from scratch, single) | 78 | No | Diverse Temporal Aggregation and Depthwise Spati... | 2020-12-01 | Code |
| 34 | TSM | 77.1 | No | TSM: Temporal Shift Module for Efficient Video U... | 2018-11-20 | Code |
| 35 | VoV3D-M (16frames, from scratch, single) | 76.9 | No | Diverse Temporal Aggregation and Depthwise Spati... | 2020-12-01 | Code |