TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

SotA/Computer Vision/Video/Kinetics-400

Video on Kinetics-400

Metric: Acc@5 (higher is better)

LeaderboardDataset
Loading chart...

Results

Submit a result
#Model↕Acc@5▼Extra DataPaperDate↕Code
1TubeViT-H (ImageNet-1k)98.9NoRethinking Video ViTs: Sparse Video Tubes for Jo...2022-12-06Code
2Unmasked Teacher (ViT-L)98.7NoUnmasked Teacher: Towards Training-Efficient Vid...2023-03-28Code
3UMT-L (ViT-L/16)98.7NoUnmasked Teacher: Towards Training-Efficient Vid...2023-03-28Code
4TubeVit-L (ImageNet-1k)98.6NoRethinking Video ViTs: Sparse Video Tubes for Jo...2022-12-06Code
5UniFormerV2-L (ViT-L, 336)98.4Yes--Code
6VideoMAE V2-g (64x266x266)98.4NoVideoMAE V2: Scaling Video Masked Autoencoders w...2023-03-29Code
7BIKE (CLIP ViT-L/14)98.4NoBidirectional Cross-Modal Knowledge Exploration ...2022-12-31Code
8MTV-H (WTS 60M)98.3NoMultiview Transformers for Video Recognition2022-01-12Code
9ATM98.3NoWhat Can Simple Arithmetic Operations Do for Tem...2023-07-18Code
10DejaVid98.2Yes--Code
11Side4Video (EVA, ViT-E/14)98.2NoSide4Video: Spatial-Temporal Side Network for Me...2023-11-27Code
12VideoMAE V2-g98.1YesVideoMAE V2: Scaling Video Masked Autoencoders w...2023-03-29Code
13ILA (ViT-L/14)97.8NoImplicit Temporal Modeling with Learnable Alignm...2023-04-20Code
14ONE-PEACE97.8NoONE-PEACE: Exploring One General Representation ...2023-05-18Code
15EVL (CLIP ViT-L/14@336px, frozen, 32 frames)97.8NoFrozen CLIP Models are Efficient Video Learners2022-08-06Code
16DualPath w/ ViT-L/1497.8NoDual-path Adaptation from Image to Video Transfo...2023-03-17Code
17AIM (CLIP ViT-L/14, 32x224)97.7YesAIM: Adapting Image Models for Efficient Video A...2023-02-06Code
18mPLUG-297.7NomPLUG-2: A Modularized Multi-modal Foundation Mo...2023-02-01Code
19TubeVit-B (ImageNet-1k)97.6NoRethinking Video ViTs: Sparse Video Tubes for Jo...2022-12-06Code
20Text4Vis (CLIP ViT-L/14)97.6NoRevisiting Classifier: Transferring Vision-Langu...2022-07-04Code
21VideoMAE (no extra data, ViT-H, 32x320x320)97.6NoVideoMAE: Masked Autoencoders are Data-Efficient...2022-03-23Code
22ST-Adapter (ViT-L, CLIP)97.6NoST-Adapter: Parameter-Efficient Image-to-Video T...2022-06-27Code
23ZeroI2V ViT-L/1497.6NoZeroI2V: Zero-Cost Adaptation of Pre-trained Tra...2023-10-02Code
24CoVeR (JFT-3B)97.5NoCo-training Transformer with Videos and Images I...2021-12-14-
25X-CLIP(ViT-L/14, CLIP)97.4NoExpanding Language-Image Pretrained Models for G...2022-08-04Code
26MVD (K400 pretrain, ViT-H, 16x224x224)97.4NoMasked Video Distillation: Rethinking Masked Fea...2022-12-08Code
27MaskFeat (K600, MViT-L)97.4NoMasked Feature Prediction for Self-Supervised Vi...2021-12-16Code
28MaskFeat (no extra data, MViT-L)97.3NoMasked Feature Prediction for Self-Supervised Vi...2021-12-16Code
29VideoMAE (no extra data, ViT-L, 32x320x320)97.3NoVideoMAE: Masked Autoencoders are Data-Efficient...2022-03-23Code
30CoVeR (JFT-300M)97.2NoCo-training Transformer with Videos and Images I...2021-12-14-
31ILA (ViT-B/16)97.2NoImplicit Temporal Modeling with Learnable Alignm...2023-04-20Code
32VideoMAE (no extra data, ViT-H)97.1NoVideoMAE: Masked Autoencoders are Data-Efficient...2022-03-23Code
33DualPath w/ ViT-B/1697.1NoDual-path Adaptation from Image to Video Transfo...2023-03-17Code
34ActionCLIP (CLIP-pretrained)97.1NoActionCLIP: A New Paradigm for Video Action Reco...2021-09-17Code
35MVD (K400 pretrain, ViT-L, 16x224x224)97NoMasked Video Distillation: Rethinking Masked Fea...2022-12-08Code
36MViTv2-L (ImageNet-21k pretrain)97YesMViTv2: Improved Multiscale Vision Transformers ...2021-12-02Code
37VideoMAE (no extra data, ViT-L, 16x4)96.8NoVideoMAE: Masked Autoencoders are Data-Efficient...2022-03-23Code
38Swin-L (384x384, ImageNet-21k pretrain)96.7NoVideo Swin Transformer2021-06-24Code
39MAR (50% mask, ViT-L, 16x4)96.3NoMAR: Masked Autoencoders for Efficient Action Re...2022-07-24Code
40OMNIVORE (Swin-B)96.2NoOmnivore: A Single Model for Many Visual Modalit...2022-01-20Code
41OMNIVORE (Swin-L)96.1NoOmnivore: A Single Model for Many Visual Modalit...2022-01-20Code
42MAR (75% mask, ViT-L, 16x4)96NoMAR: Masked Autoencoders for Efficient Action Re...2022-07-24Code
43Swin-L (ImageNet-21k pretrain)95.9NoVideo Swin Transformer2021-06-24Code
44ViViT-H/16x2 (JFT)95.8NoViViT: A Video Vision Transformer2021-03-29Code
45MVD (K400 pretrain, ViT-B, 16x224x224)95.8NoMasked Video Distillation: Rethinking Masked Fea...2022-12-08Code
46ILA (ViT-B/32)95.8NoImplicit Temporal Modeling with Learnable Alignm...2023-04-20Code
47Swin-B (ImageNet-21k pretrain)95.5NoVideo Swin Transformer2021-06-24Code
48VATT-Large95.5NoVATT: Transformers for Multimodal Self-Supervise...2021-04-22Code
49ip-CSN-152 (IG-65M pretraining)95.3NoVideo Classification with Channel-Separated Conv...2019-04-04Code
50AMD(ViT-B/16)95.3NoAsymmetric Masked Distillation for Pre-Training ...2023-11-06-
51AdaMAE95.2NoAdaMAE: Adaptive Masking for Efficient Spatiotem...2022-11-16Code
52LGD-3D Two-stream (ResNet-101)95.2NoLearning Spatio-Temporal Representation with Loc...2019-06-13-
53Motionformer-HR95.2NoKeeping Your Eye on the Ball: Trajectory Attenti...2021-06-09Code
54VideoMAE (no extra data, ViT-B, 16x4)95.1NoVideoMAE: Masked Autoencoders are Data-Efficient...2022-03-23Code
55R[2+1]D-152 (IG-65M pretraining)95.1NoVideo Classification with Channel-Separated Conv...2019-04-04Code
56MViT-B, 64x395.1NoMultiscale Vision Transformers2021-04-22Code
57MoViNet-A594.9NoMoViNets: Mobile Video Networks for Efficient Vi...2021-03-21Code
58DirecFormer94.86NoDirecFormer: A Directed Attention in Transformer...2022-03-19Code
59MVD (K400 pretrain, ViT-S, 16x224x224)94.8NoMasked Video Distillation: Rethinking Masked Fea...2022-12-08Code
60TimeSformer-L94.7NoIs Space-Time Attention All You Need for Video U...2021-02-09Code
61ViViT-L/16x2 32094.7NoViViT: A Video Vision Transformer2021-03-29Code
62MBT (AV)94.6NoAttention Bottlenecks for Multimodal Fusion2021-06-30Code
63Swin-B (ImageNet-1k pretrain)94.6NoVideo Swin Transformer2021-06-24Code
64En-VidTr-L94.6NoVidTr: Video Transformer Without Convolutions2021-04-23-
65X3D-XXL94.6NoX3D: Expanding Architectures for Efficient Video...2020-04-09Code
66UniFormer-B (ImageNet-1K)94.5No--Code
67Swin-S (ImageNet-1k pretrain)94.5NoVideo Swin Transformer2021-06-24Code
68MoViNet-A494.5NoMoViNets: Mobile Video Networks for Efficient Vi...2021-03-21Code
69AMD(ViT-S/16)94.5NoAsymmetric Masked Distillation for Pre-Training ...2023-11-06-
70OmniVL94.5NoOmniVL:One Foundation Model for Image-Language a...2022-09-15-
71MAR (50% mask, ViT-B, 16x4)94.4NoMAR: Masked Autoencoders for Efficient Action Re...2022-07-24Code
72OmniSource SlowOnly R101 8x8(ImageNet pretrain)94.4NoOmni-sourced Webly-supervised Learning for Video...2020-03-29Code
73R3D-RS-20094.4NoRevisiting 3D ResNets for Video Recognition2021-09-03Code
74OmniSource SlowOnly R101 8x8 (Scratch)94.4NoOmni-sourced Webly-supervised Learning for Video...2020-03-29Code
75MViT-B, 32x394.4NoMultiscale Vision Transformers2021-04-22Code
76TimeSformer-HR94.4NoIs Space-Time Attention All You Need for Video U...2021-02-09Code
77LGD-3D RGB (ResNet-101)94.4NoLearning Spatio-Temporal Representation with Loc...2019-06-13-
78TDN-ResNet101 (ensemble, ImageNet pretrained, RGB only)94.4NoTDN: Temporal Difference Networks for Efficient ...2020-12-18Code
79En-VidTr-M94.2NoVidTr: Video Transformer Without Convolutions2021-04-23-
80ViT-B-VTN+ ImageNet-21K (84.0 [10])94.2NoVideo Transformer Network2021-02-01Code
81En-VidTr-S94NoVidTr: Video Transformer Without Convolutions2021-04-23-
82X3D-XL93.9NoX3D: Expanding Architectures for Efficient Video...2020-04-09Code
83SlowFast 16x8 (ResNet-101 + NL)93.9NoSlowFast Networks for Video Recognition2018-12-10Code
84ip-CSN-152 (Sports-1M pretraining)93.8NoVideo Classification with Channel-Separated Conv...2019-04-04Code
85MVFNet-ResNet101 (ensemble, ImageNet pretrained, RGB only)93.8NoMVFNet: Multi-View Fusion Network for Efficient ...2020-12-13Code
86MoViNet-A393.8NoMoViNets: Mobile Video Networks for Efficient Vi...2021-03-21Code
87MAR (75% mask, ViT-B, 16x4)93.7NoMAR: Masked Autoencoders for Efficient Action Re...2022-07-24Code
88TAdaConvNeXt-T93.7NoTAda! Temporally-Adaptive Convolutions for Video...2021-10-12Code
89ViT-B-VTN (3 layers, ImageNet pretrain)93.7NoVideo Transformer Network2021-02-01Code
90TimeSformer93.7NoIs Space-Time Attention All You Need for Video U...2021-02-09Code
91Swin-T (ImageNet-1k pretrain)93.6NoVideo Swin Transformer2021-06-24Code
92SlowFast 16x8 (ResNet-101)93.5NoSlowFast Networks for Video Recognition2018-12-10Code
93MViT-B, 16x493.5NoMultiscale Vision Transformers2021-04-22Code
94TAda2D-En (ResNet-50, 8+16 frames)93.5NoTAda! Temporally-Adaptive Convolutions for Video...2021-10-12Code
95S3D-G (RGB, ImageNet pretrained)93.4NoRethinking Spatiotemporal Feature Learning: Spee...2017-12-13Code
96ViT-B-VTN (1 layer, ImageNet pretrain)93.4NoVideo Transformer Network2021-02-01Code
97I3D + NL93.3NoNon-local Neural Networks2017-11-21Code
98SlowFast 8x8 (ResNet-101)93.2NoSlowFast Networks for Video Recognition2018-12-10Code
99BQN (ResNet-50)93.2NoBusy-Quiet Video Disentangling for Video Classif...2021-03-29Code
100TAda2D (ResNet-50, 16 frames)93.1NoTAda! Temporally-Adaptive Convolutions for Video...2021-10-12Code
101S3D-G (RGB+Flow, ImageNet pretrained)93NoRethinking Spatiotemporal Feature Learning: Spee...2017-12-13Code
102X3D-L92.9NoX3D: Expanding Architectures for Efficient Video...2020-04-09Code
103ip-CSN-15292.8NoVideo Classification with Channel-Separated Conv...2019-04-04Code
104SlowFast 8x8 (ResNet-50)92.6NoSlowFast Networks for Video Recognition2018-12-10Code
105TAda2D (ResNet-50, 8 frames)92.6NoTAda! Temporally-Adaptive Convolutions for Video...2021-10-12Code
106X3D-M92.3NoX3D: Expanding Architectures for Efficient Video...2020-04-09Code
107MoViNet-A292.3NoMoViNets: Mobile Video Networks for Efficient Vi...2021-03-21Code
108MViT-S92.1NoMultiscale Vision Transformers2021-04-22Code
109SlowFast 4x16 (ResNet-50)92.1NoSlowFast Networks for Video Recognition2018-12-10Code
110R[2+1]D-Flow (Sports-1M pretrain)91.9NoA Closer Look at Spatiotemporal Convolutions for...2017-11-30Code
111A2 Net91.5No$A^2$-Nets: Double Attention Networks2018-10-27-
112R[2+1]D-RGB (Sports-1M pretrain)91.4NoA Closer Look at Spatiotemporal Convolutions for...2017-11-30Code
113bLVNet Fan et al. (2019)91.2NoMore Is Less: Learning Efficient Video Represent...2019-12-02Code
114MoViNet-A191.2NoMoViNets: Mobile Video Networks for Efficient Vi...2021-03-21Code
115TSN91.1NoTemporal Segment Networks: Towards Good Practice...2016-08-02Code
116R[2+1]D-Two-Stream90.9NoA Closer Look at Spatiotemporal Convolutions for...2017-11-30Code
117Inception-ResNet90.9NoRevisiting the Effectiveness of Off-the-shelf Te...2017-08-12-
118LGD-3D Flow (ResNet-101)90.9NoLearning Spatio-Temporal Representation with Loc...2019-06-13-
119MFNet90.4NoMulti-Fiber Networks for Video Recognition2018-07-30-
120ARTNet90.4NoAppearance-and-Relation Networks for Video Class...2017-11-24Code
121R[2+1]D90NoA Closer Look at Spatiotemporal Convolutions for...2017-11-30Code
122R[2+1]D-RGB90NoA Closer Look at Spatiotemporal Convolutions for...2017-11-30Code
123I3D89.3NoQuo Vadis, Action Recognition? A New Model and t...2017-05-22Code
124S3D-G (Flow, ImageNet pretrained)87.6NoRethinking Spatiotemporal Feature Learning: Spee...2017-12-13Code
125MoViNet-A087.4NoMoViNets: Mobile Video Networks for Efficient Vi...2021-03-21Code
126R[2+1]D-Flow87.2NoA Closer Look at Spatiotemporal Convolutions for...2017-11-30Code