Metric: Parameters (higher is better)
| # | Model↕ | Parameters▼ | Extra Data | Paper | Date↕ | Code |
|---|---|---|---|---|---|---|
| 1 | InternVideo2-6B | 2131 | Yes | InternVideo2: Scaling Foundation Models for Mult... | 2024-03-22 | Code |
| 2 | VideoMAE V2-g | 1013 | Yes | VideoMAE V2: Scaling Video Masked Autoencoders w... | 2023-03-29 | Code |
| 3 | MVD (Kinetics400 pretrain, ViT-H, 16 frame) | 633 | Yes | Masked Video Distillation: Rethinking Masked Fea... | 2022-12-08 | Code |
| 4 | MAR (50% mask, ViT-L, 16x4) | 311 | No | MAR: Masked Autoencoders for Efficient Action Re... | 2022-07-24 | Code |
| 5 | MAR (75% mask, ViT-L, 16x4) | 311 | No | MAR: Masked Autoencoders for Efficient Action Re... | 2022-07-24 | Code |
| 6 | MVD (Kinetics400 pretrain, ViT-L, 16 frame) | 305 | Yes | Masked Video Distillation: Rethinking Masked Fea... | 2022-12-08 | Code |
| 7 | VideoMAE (no extra data, ViT-L, 32x2) | 305 | No | VideoMAE: Masked Autoencoders are Data-Efficient... | 2022-03-23 | Code |
| 8 | VideoMAE (no extra data, ViT-L, 16frame) | 305 | No | VideoMAE: Masked Autoencoders are Data-Efficient... | 2022-03-23 | Code |
| 9 | MaskFeat (Kinetics600 pretrain, MViT-L) | 218 | Yes | Masked Feature Prediction for Self-Supervised Vi... | 2021-12-16 | Code |
| 10 | MViTv2-L (IN-21K + Kinetics400 pretrain) | 213.1 | No | MViTv2: Improved Multiscale Vision Transformers ... | 2021-12-02 | Code |
| 11 | MAR (50% mask, ViT-B, 16x4) | 94 | No | MAR: Masked Autoencoders for Efficient Action Re... | 2022-07-24 | Code |
| 12 | MAR (75% mask, ViT-B, 16x4) | 94 | No | MAR: Masked Autoencoders for Efficient Action Re... | 2022-07-24 | Code |
| 13 | BEVT (IN-1K + Kinetics400 pretrain) | 89 | Yes | BEVT: BERT Pretraining of Video Transformers | 2021-12-02 | Code |
| 14 | Swin-B (IN-21K + Kinetics400 pretrain) | 89 | Yes | Video Swin Transformer | 2021-06-24 | Code |
| 15 | MVD (Kinetics400 pretrain, ViT-B, 16 frame) | 87 | Yes | Masked Video Distillation: Rethinking Masked Fea... | 2022-12-08 | Code |
| 16 | AMD(ViT-B/16) | 87 | No | Asymmetric Masked Distillation for Pre-Training ... | 2023-11-06 | - |
| 17 | VideoMAE (no extra data, ViT-B, 16frame) | 87 | No | VideoMAE: Masked Autoencoders are Data-Efficient... | 2022-03-23 | Code |
| 18 | CT-Net Ensemble (R50, 8+12+16+24) | 83.8 | Yes | CT-Net: Channel Tensorization Network for Video ... | 2021-06-03 | Code |
| 19 | MorphMLP-B (IN-1K) | 68.5 | Yes | MorphMLP: An Efficient MLP-Like Backbone for Spa... | 2021-11-24 | Code |
| 20 | MViTv2-B (IN-21K + Kinetics400 pretrain) | 51.1 | No | MViTv2: Improved Multiscale Vision Transformers ... | 2021-12-02 | Code |
| 21 | UniFormer-B (IN-1K + Kinetics400 pretrain) | 50.1 | Yes | - | - | Code |
| 22 | MViT-B, 32x3(Kinetics600 pretrain) | 36.6 | Yes | Multiscale Vision Transformers | 2021-04-22 | Code |
| 23 | GC-TDN Ensemble (R50,8+16) | 27.4 | Yes | Group Contextualization for Video Recognition | 2022-03-18 | Code |
| 24 | MVD (Kinetics400 pretrain, ViT-S, 16 frame) | 22 | Yes | Masked Video Distillation: Rethinking Masked Fea... | 2022-12-08 | Code |
| 25 | AMD(ViT-S/16) | 22 | No | Asymmetric Masked Distillation for Pre-Training ... | 2023-11-06 | - |
| 26 | UniFormer-S (IN-1K + Kinetics600 pretrain) | 21.4 | Yes | - | - | Code |