| 1 | InternVideo2-6B | 91.9 | Yes | InternVideo2: Scaling Foundation Models for Mult... | 2024-03-22 | Code |
| 2 | TubeVit-H | 91.8 | Yes | Rethinking Video ViTs: Sparse Video Tubes for Jo... | 2022-12-06 | Code |
| 3 | InternVideo2-1B | 91.6 | Yes | InternVideo2: Scaling Foundation Models for Mult... | 2024-03-22 | Code |
| 4 | TubeVit-L | 91.5 | Yes | Rethinking Video ViTs: Sparse Video Tubes for Jo... | 2022-12-06 | Code |
| 5 | InternVideo-T | 91.3 | Yes | InternVideo: General Video Foundation Models via... | 2022-12-06 | Code |
| 6 | 🍷MerlotReserve-Large (+Audio) | 91.1 | Yes | MERLOT Reserve: Neural Script Knowledge through ... | 2022-01-07 | - |
| 7 | TubeVit-B | 90.9 | Yes | Rethinking Video ViTs: Sparse Video Tubes for Jo... | 2022-12-06 | Code |
| 8 | UMT-L (ViT-L/16) | 90.5 | Yes | Unmasked Teacher: Towards Training-Efficient Vid... | 2023-03-28 | Code |
| 9 | MTV-H (WTS 60M) | 90.3 | Yes | Multiview Transformers for Video Recognition | 2022-01-12 | Code |
| 10 | UniFormerV2-L | 90.1 | Yes | - | - | Code |
| 11 | VideoMAE V2-g (64x266x266) | 89.9 | Yes | VideoMAE V2: Scaling Video Masked Autoencoders w... | 2023-03-29 | Code |
| 12 | mPLUG-2 | 89.8 | Yes | mPLUG-2: A Modularized Multi-modal Foundation Mo... | 2023-02-01 | Code |
| 13 | 🍷MerlotReserve-Base (+Audio) | 89.7 | Yes | MERLOT Reserve: Neural Script Knowledge through ... | 2022-01-07 | - |
| 14 | 🍷MerlotReserve-Large (no Audio) | 89.4 | Yes | MERLOT Reserve: Neural Script Knowledge through ... | 2022-01-07 | - |
| 15 | CoCa (finetuned) | 89.4 | Yes | CoCa: Contrastive Captioners are Image-Text Foun... | 2022-05-04 | Code |
| 16 | VideoMAE V2-g | 88.8 | Yes | VideoMAE V2: Scaling Video Masked Autoencoders w... | 2023-03-29 | Code |
| 17 | Hiera-H (no extra data) | 88.8 | No | Hiera: A Hierarchical Vision Transformer without... | 2023-06-01 | Code |
| 18 | CoCa (frozen) | 88.5 | Yes | CoCa: Contrastive Captioners are Image-Text Foun... | 2022-05-04 | Code |
| 19 | MaskFeat (no extra data, MViT-L) | 88.3 | No | Masked Feature Prediction for Self-Supervised Vi... | 2021-12-16 | Code |
| 20 | X-CLIP(ViT-L/14, CLIP) | 88.3 | Yes | Expanding Language-Image Pretrained Models for G... | 2022-08-04 | Code |
| 21 | 🍷MerlotReserve-Base (no Audio) | 88.1 | Yes | MERLOT Reserve: Neural Script Knowledge through ... | 2022-01-07 | - |
| 22 | MViTv2-L (ImageNet-21k pretrain) | 87.9 | No | MViTv2: Improved Multiscale Vision Transformers ... | 2021-12-02 | Code |
| 23 | CoVeR (JFT-3B) | 87.9 | Yes | Co-training Transformer with Videos and Images I... | 2021-12-14 | - |
| 24 | Florence (curated FLD-900M pretrain) | 87.8 | Yes | Florence: A New Foundation Model for Computer Vi... | 2021-11-22 | Code |
| 25 | CoVeR (JFT-300M) | 86.8 | Yes | Co-training Transformer with Videos and Images I... | 2021-12-14 | - |
| 26 | TokenLearner 16at18 w. Fuser (L/10) | 86.3 | Yes | TokenLearner: What Can 8 Learned Tokens Do for I... | 2021-06-21 | Code |
| 27 | Swin-L (384x384, ImageNet-21k pretrain) | 86.1 | Yes | Video Swin Transformer | 2021-06-24 | Code |
| 28 | ViViT-H/16x2 (JFT) | 85.8 | Yes | ViViT: A Video Vision Transformer | 2021-03-29 | Code |
| 29 | MViTv2-L (train from scratch) | 85.5 | No | MViTv2: Improved Multiscale Vision Transformers ... | 2021-12-02 | Code |
| 30 | UniFormer-B (ImageNet-1K) | 84.8 | Yes | - | - | Code |
| 31 | XViT (x16) | 84.5 | No | Space-time Mixing Attention for Video Transformer | 2021-06-10 | Code |
| 32 | MoViNet-A5 (AutoAugment) | 84.3 | No | MoViNets: Mobile Video Networks for Efficient Vi... | 2021-03-21 | Code |
| 33 | ViViT-L/16x2 | 84.3 | No | ViViT: A Video Vision Transformer | 2021-03-29 | Code |
| 34 | Swin-B (ImageNet-21k pretrain) | 84 | Yes | Video Swin Transformer | 2021-06-24 | Code |
| 35 | MViT-B-24, 32x3 | 83.8 | No | Multiscale Vision Transformers | 2021-04-22 | Code |
| 36 | VATT-Large | 83.6 | Yes | VATT: Transformers for Multimodal Self-Supervise... | 2021-04-22 | Code |
| 37 | MoViNet-A6 | 83.5 | No | MoViNets: Mobile Video Networks for Efficient Vi... | 2021-03-21 | Code |
| 38 | MViT-B, 32x3 | 83.4 | No | Multiscale Vision Transformers | 2021-04-22 | Code |
| 39 | LGD-3D Two-stream | 83.1 | No | Learning Spatio-Temporal Representation with Loc... | 2019-06-13 | - |
| 40 | R3D-RS-200 | 83.1 | No | Revisiting 3D ResNets for Video Recognition | 2021-09-03 | Code |
| 41 | ViViT-L/16x2 (320x320) | 83 | No | ViViT: A Video Vision Transformer | 2021-03-29 | Code |
| 42 | MoViNet-A5 | 82.7 | No | MoViNets: Mobile Video Networks for Efficient Vi... | 2021-03-21 | Code |
| 43 | MViT-B, 16x4 | 82.1 | No | Multiscale Vision Transformers | 2021-04-22 | Code |
| 44 | PERF-Net (distilled ResNet50-G) | 82 | No | PERF-Net: Pose Empowered RGB-Flow Net | 2020-09-28 | - |
| 45 | SlowFast 16x8 (ResNet-101 + NL) | 81.8 | No | SlowFast Networks for Video Recognition | 2018-12-10 | Code |
| 46 | LGD-3D RGB | 81.5 | No | Learning Spatio-Temporal Representation with Loc... | 2019-06-13 | - |
| 47 | MoViNet-A4 | 81.2 | No | MoViNets: Mobile Video Networks for Efficient Vi... | 2021-03-21 | Code |
| 48 | SlowFast 16x8 (ResNet-101) | 81.1 | No | SlowFast Networks for Video Recognition | 2018-12-10 | Code |
| 49 | MoViNet-A3 | 80.8 | No | MoViNets: Mobile Video Networks for Efficient Vi... | 2021-03-21 | Code |
| 50 | SlowFast 8x8 (ResNet-101) | 80.4 | No | SlowFast Networks for Video Recognition | 2018-12-10 | Code |
| 51 | SlowFast 8x8 (ResNet-50) | 79.9 | No | SlowFast Networks for Video Recognition | 2018-12-10 | Code |
| 52 | D3D+S3D-G | 79.1 | No | D3D: Distilled 3D Networks for Video Action Reco... | 2018-12-19 | Code |
| 53 | SlowFast 4x16 (ResNet-50) | 78.8 | No | SlowFast Networks for Video Recognition | 2018-12-10 | Code |
| 54 | S3D-G (RGB+Flow) | 78.6 | No | Rethinking Spatiotemporal Feature Learning: Spee... | 2017-12-13 | Code |
| 55 | D3D | 77.9 | No | D3D: Distilled 3D Networks for Video Action Reco... | 2018-12-19 | Code |
| 56 | MoViNet-A2 | 77.5 | No | MoViNets: Mobile Video Networks for Efficient Vi... | 2021-03-21 | Code |
| 57 | S3D-G (RGB) | 76.6 | No | Rethinking Spatiotemporal Feature Learning: Spee... | 2017-12-13 | Code |
| 58 | MoViNet-A1 | 76 | No | MoViNets: Mobile Video Networks for Efficient Vi... | 2021-03-21 | Code |
| 59 | LGD-3D Flow | 75 | No | Learning Spatio-Temporal Representation with Loc... | 2019-06-13 | - |
| 60 | I3D (RGB) | 73.6 | No | A Short Note about Kinetics-600 | 2018-08-03 | Code |
| 61 | MoViNet-A0 | 71.5 | No | MoViNets: Mobile Video Networks for Efficient Vi... | 2021-03-21 | Code |
| 62 | S3D-G (Flow) | 69.7 | No | Rethinking Spatiotemporal Feature Learning: Spee... | 2017-12-13 | Code |