| 1 | TubeVit-H | 98.9 | Yes | Rethinking Video ViTs: Sparse Video Tubes for Jo... | 2022-12-06 | Code |
| 2 | UMT-L (ViT-L/16) | 98.8 | Yes | Unmasked Teacher: Towards Training-Efficient Vid... | 2023-03-28 | Code |
| 3 | TubeVit-L | 98.7 | Yes | Rethinking Video ViTs: Sparse Video Tubes for Jo... | 2022-12-06 | Code |
| 4 | MTV-H (WTS 60M) | 98.5 | Yes | Multiview Transformers for Video Recognition | 2022-01-12 | Code |
| 5 | UniFormerV2-L | 98.5 | Yes | - | - | Code |
| 6 | VideoMAE V2-g (64x266x266) | 98.5 | Yes | VideoMAE V2: Scaling Video Masked Autoencoders w... | 2023-03-29 | Code |
| 7 | mPLUG-2 | 98.3 | Yes | mPLUG-2: A Modularized Multi-modal Foundation Mo... | 2023-02-01 | Code |
| 8 | VideoMAE V2-g | 98.2 | Yes | VideoMAE V2: Scaling Video Masked Autoencoders w... | 2023-03-29 | Code |
| 9 | MaskFeat (no extra data, MViT-L) | 98 | No | Masked Feature Prediction for Self-Supervised Vi... | 2021-12-16 | Code |
| 10 | MViTv2-L (ImageNet-21k pretrain) | 97.9 | No | MViTv2: Improved Multiscale Vision Transformers ... | 2021-12-02 | Code |
| 11 | Florence (curated FLD-900M pretrain) | 97.9 | Yes | Florence: A New Foundation Model for Computer Vi... | 2021-11-22 | Code |
| 12 | CoVeR (JFT-3B) | 97.8 | Yes | Co-training Transformer with Videos and Images I... | 2021-12-14 | - |
| 13 | X-CLIP(ViT-L/14, CLIP) | 97.7 | Yes | Expanding Language-Image Pretrained Models for G... | 2022-08-04 | Code |
| 14 | TubeVit-B | 97.3 | Yes | Rethinking Video ViTs: Sparse Video Tubes for Jo... | 2022-12-06 | Code |
| 15 | CoVeR (JFT-300M) | 97.3 | Yes | Co-training Transformer with Videos and Images I... | 2021-12-14 | - |
| 16 | Swin-L (384x384, ImageNet-21k pretrain) | 97.3 | Yes | Video Swin Transformer | 2021-06-24 | Code |
| 17 | MViTv2-B (train from scratch) | 97.2 | No | MViTv2: Improved Multiscale Vision Transformers ... | 2021-12-02 | Code |
| 18 | 🍷MerlotReserve-Large (+Audio) | 97.1 | Yes | MERLOT Reserve: Neural Script Knowledge through ... | 2022-01-07 | - |
| 19 | TokenLearner 16at18 w. Fuser (L/10) | 97 | Yes | TokenLearner: What Can 8 Learned Tokens Do for I... | 2021-06-21 | Code |
| 20 | UniFormer-B (ImageNet-1K) | 96.7 | Yes | - | - | Code |
| 21 | 🍷MerlotReserve-Base (+Audio) | 96.6 | Yes | MERLOT Reserve: Neural Script Knowledge through ... | 2022-01-07 | - |
| 22 | VATT-Large | 96.6 | Yes | VATT: Transformers for Multimodal Self-Supervise... | 2021-04-22 | Code |
| 23 | ViViT-H/16x2 (JFT) | 96.5 | Yes | ViViT: A Video Vision Transformer | 2021-03-29 | Code |
| 24 | Swin-B (ImageNet-21k pretrain) | 96.5 | Yes | Video Swin Transformer | 2021-06-24 | Code |
| 25 | MoViNet-A6 | 96.5 | No | MoViNets: Mobile Video Networks for Efficient Vi... | 2021-03-21 | Code |
| 26 | MoViNet-A5 (AutoAugment) | 96.4 | No | MoViNets: Mobile Video Networks for Efficient Vi... | 2021-03-21 | Code |
| 27 | 🍷MerlotReserve-Large (no Audio) | 96.3 | Yes | MERLOT Reserve: Neural Script Knowledge through ... | 2022-01-07 | - |
| 28 | XViT (x16) | 96.3 | No | Space-time Mixing Attention for Video Transformer | 2021-06-10 | Code |
| 29 | MViT-B-24, 32x3 | 96.3 | No | Multiscale Vision Transformers | 2021-04-22 | Code |
| 30 | MViT-B, 32x3 | 96.3 | No | Multiscale Vision Transformers | 2021-04-22 | Code |
| 31 | LGD-3D Two-stream | 96.2 | No | Learning Spatio-Temporal Representation with Loc... | 2019-06-13 | - |
| 32 | 🍷MerlotReserve-Base (no Audio) | 95.8 | Yes | MERLOT Reserve: Neural Script Knowledge through ... | 2022-01-07 | - |
| 33 | ViViT-L/16x2 (320x320) | 95.7 | No | ViViT: A Video Vision Transformer | 2021-03-29 | Code |
| 34 | MoViNet-A5 | 95.7 | No | MoViNets: Mobile Video Networks for Efficient Vi... | 2021-03-21 | Code |
| 35 | MViT-B, 16x4 | 95.7 | No | Multiscale Vision Transformers | 2021-04-22 | Code |
| 36 | PERF-Net (distilled ResNet50-G) | 95.7 | No | PERF-Net: Pose Empowered RGB-Flow Net | 2020-09-28 | - |
| 37 | ViViT-L/16x2 | 95.6 | No | ViViT: A Video Vision Transformer | 2021-03-29 | Code |
| 38 | LGD-3D RGB | 95.6 | No | Learning Spatio-Temporal Representation with Loc... | 2019-06-13 | - |
| 39 | SlowFast 16x8 (ResNet-101 + NL) | 95.1 | No | SlowFast Networks for Video Recognition | 2018-12-10 | Code |
| 40 | SlowFast 16x8 (ResNet-101) | 95.1 | No | SlowFast Networks for Video Recognition | 2018-12-10 | Code |
| 41 | MoViNet-A4 | 94.9 | No | MoViNets: Mobile Video Networks for Efficient Vi... | 2021-03-21 | Code |
| 42 | SlowFast 8x8 (ResNet-101) | 94.8 | No | SlowFast Networks for Video Recognition | 2018-12-10 | Code |
| 43 | SlowFast 8x8 (ResNet-50) | 94.5 | No | SlowFast Networks for Video Recognition | 2018-12-10 | Code |
| 44 | SlowFast 4x16 (ResNet-50) | 94 | No | SlowFast Networks for Video Recognition | 2018-12-10 | Code |
| 45 | MoViNet-A2 | 93.4 | No | MoViNets: Mobile Video Networks for Efficient Vi... | 2021-03-21 | Code |
| 46 | MoViNet-A1 | 92.6 | No | MoViNets: Mobile Video Networks for Efficient Vi... | 2021-03-21 | Code |
| 47 | LGD-3D Flow | 92.4 | No | Learning Spatio-Temporal Representation with Loc... | 2019-06-13 | - |
| 48 | MoViNet-A0 | 90.4 | No | MoViNets: Mobile Video Networks for Efficient Vi... | 2021-03-21 | Code |
| 49 | MoViNet-A3 | 80.8 | No | MoViNets: Mobile Video Networks for Efficient Vi... | 2021-03-21 | Code |