Metric: mAP (higher is better)
| # | Model↕ | mAP▼ | Extra Data | Paper | Date↕ | Code |
|---|---|---|---|---|---|---|
| 1 | LART (Hiera-H, K700 PT+FT) | 45.1 | Yes | On the Benefits of 3D Pose and Tracking for Huma... | 2023-04-03 | Code |
| 2 | Hiera-H (K700 PT+FT) | 43.3 | Yes | Hiera: A Hierarchical Vision Transformer without... | 2023-06-01 | Code |
| 3 | VideoMAE V2-g | 42.6 | Yes | VideoMAE V2: Scaling Video Masked Autoencoders w... | 2023-03-29 | Code |
| 4 | STAR/L | 41.7 | Yes | End-to-End Spatio-Temporal Action Localisation w... | 2023-04-24 | - |
| 5 | MVD (Kinetics400 pretrain+finetune, ViT-H, 16x4) | 41.1 | Yes | Masked Video Distillation: Rethinking Masked Fea... | 2022-12-08 | Code |
| 6 | InternVideo | 41.01 | Yes | InternVideo: General Video Foundation Models via... | 2022-12-06 | Code |
| 7 | MVD (Kinetics400 pretrain, ViT-H, 16x4) | 40.1 | Yes | Masked Video Distillation: Rethinking Masked Fea... | 2022-12-08 | Code |
| 8 | MaskFeat (Kinetics-600 pretrain, MViT-L) | 39.8 | Yes | Masked Feature Prediction for Self-Supervised Vi... | 2021-12-16 | Code |
| 9 | UMT-L (ViT-L/16) | 39.8 | Yes | Unmasked Teacher: Towards Training-Efficient Vid... | 2023-03-28 | Code |
| 10 | VideoMAE (K400 pretrain+finetune, ViT-H, 16x4) | 39.5 | Yes | VideoMAE: Masked Autoencoders are Data-Efficient... | 2022-03-23 | Code |
| 11 | VideoMAE (K700 pretrain+finetune, ViT-L, 16x4) | 39.3 | Yes | VideoMAE: Masked Autoencoders are Data-Efficient... | 2022-03-23 | Code |
| 12 | MVD (Kinetics400 pretrain+finetune, ViT-L, 16x4) | 38.7 | Yes | Masked Video Distillation: Rethinking Masked Fea... | 2022-12-08 | Code |
| 13 | VideoMAE (K400 pretrain+finetune, ViT-L, 16x4) | 37.8 | Yes | VideoMAE: Masked Autoencoders are Data-Efficient... | 2022-03-23 | Code |
| 14 | MVD (Kinetics400 pretrain, ViT-L, 16x4) | 37.7 | Yes | Masked Video Distillation: Rethinking Masked Fea... | 2022-12-08 | Code |
| 15 | VideoMAE (K400 pretrain, ViT-H, 16x4) | 36.5 | Yes | VideoMAE: Masked Autoencoders are Data-Efficient... | 2022-03-23 | Code |
| 16 | VideoMAE (K700 pretrain, ViT-L, 16x4) | 36.1 | Yes | VideoMAE: Masked Autoencoders are Data-Efficient... | 2022-03-23 | Code |
| 17 | MeMViT-24 | 35.4 | Yes | MeMViT: Memory-Augmented Multiscale Vision Trans... | 2022-01-20 | Code |
| 18 | MViTv2-L (IN21k, K700) | 34.4 | Yes | MViTv2: Improved Multiscale Vision Transformers ... | 2021-12-02 | Code |
| 19 | VideoMAE (K400 pretrain, ViT-L, 16x4) | 34.3 | Yes | VideoMAE: Masked Autoencoders are Data-Efficient... | 2022-03-23 | Code |
| 20 | MVD (Kinetics400 pretrain+finetune, ViT-B, 16x4) | 34.2 | Yes | Masked Video Distillation: Rethinking Masked Fea... | 2022-12-08 | Code |
| 21 | AMD(ViT-B/16) | 33.5 | Yes | Asymmetric Masked Distillation for Pre-Training ... | 2023-11-06 | - |
| 22 | HIT | 32.6 | No | Holistic Interaction Transformer Network for Act... | 2022-10-23 | Code |
| 23 | VideoMAE (K400 pretrain+finetune, ViT-B, 16x4) | 31.8 | Yes | VideoMAE: Masked Autoencoders are Data-Efficient... | 2022-03-23 | Code |
| 24 | ACAR-Net, SlowFast R-101 (Kinetics-700 pretraining) | 31.72 | Yes | Actor-Context-Actor Relation Network for Spatio-... | 2020-06-14 | Code |
| 25 | MVD (Kinetics400 pretrain, ViT-B, 16x4) | 31.1 | Yes | Masked Video Distillation: Rethinking Masked Fea... | 2022-12-08 | Code |
| 26 | Object Transformer | 31 | No | Towards Long-Form Video Understanding | 2021-06-21 | Code |
| 27 | MViT-B-24, 32x3 (Kinetics-600 pretraining) | 28.7 | No | Multiscale Vision Transformers | 2021-04-22 | Code |
| 28 | MViT-B, 32x3 (Kinetics-500 pretraining) | 27.5 | No | Multiscale Vision Transformers | 2021-04-22 | Code |
| 29 | SlowFast, 16x8 R101+NL (Kinetics-600 pretraining) | 27.5 | No | SlowFast Networks for Video Recognition | 2018-12-10 | Code |
| 30 | MViT-B, 64x3 (Kinetics-400 pretraining) | 27.3 | No | Multiscale Vision Transformers | 2021-04-22 | Code |
| 31 | SlowFast, 8x8 R101+NL (Kinetics-600 pretraining) | 27.1 | No | SlowFast Networks for Video Recognition | 2018-12-10 | Code |
| 32 | MViT-B, 32x3 (Kinetics-400 pretraining) | 26.8 | No | Multiscale Vision Transformers | 2021-04-22 | Code |
| 33 | VideoMAE (K400 pretrain, ViT-B, 16x4) | 26.7 | Yes | VideoMAE: Masked Autoencoders are Data-Efficient... | 2022-03-23 | Code |
| 34 | ORViT MViT-B, 16x4 (K400 pretraining) | 26.6 | No | Object-Region Video Transformers | 2021-10-13 | Code |
| 35 | MViT-B, 16x4 (Kinetics-600 pretraining) | 26.1 | No | Multiscale Vision Transformers | 2021-04-22 | Code |
| 36 | MViT-B, 16x4 (Kinetics-400 pretraining) | 24.5 | No | Multiscale Vision Transformers | 2021-04-22 | Code |
| 37 | SlowFast, 8x8, R101 (Kinetics-400 pretraining) | 23.8 | No | SlowFast Networks for Video Recognition | 2018-12-10 | Code |
| 38 | SlowFast, 4x16, R50 (Kinetics-400 pretraining) | 21.9 | No | SlowFast Networks for Video Recognition | 2018-12-10 | Code |