TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

SotA/Computer Vision/Video/Kinetics-400

Video on Kinetics-400

Metric: Acc@1 (higher is better)

LeaderboardDataset
Loading chart...

Results

Submit a result
#Model↕Acc@1▼Extra DataPaperDate↕Code
1OmniVec293.6No---
2FTP-UniFormerV2-L/1493.4NoEnhancing Video Transformers for Action Understa...2024-03-24-
3InternVideo2-6B92.1YesInternVideo2: Scaling Foundation Models for Mult...2024-03-22Code
4InternVideo2-1B91.6YesInternVideo2: Scaling Foundation Models for Mult...2024-03-22Code
5InternVideo91.1NoInternVideo: General Video Foundation Models via...2022-12-06Code
6OmniVec91.1NoOmniVec: Learning robust representations with cr...2023-11-07-
7TubeViT-H (ImageNet-1k)90.9NoRethinking Video ViTs: Sparse Video Tubes for Jo...2022-12-06Code
8Unmasked Teacher (ViT-L)90.6NoUnmasked Teacher: Towards Training-Efficient Vid...2023-03-28Code
9UMT-L (ViT-L/16)90.6NoUnmasked Teacher: Towards Training-Efficient Vid...2023-03-28Code
10TubeVit-L (ImageNet-1k)90.2NoRethinking Video ViTs: Sparse Video Tubes for Jo...2022-12-06Code
11UniFormerV2-L (ViT-L, 336)90Yes--Code
12VideoMAE V2-g (64x266x266)90NoVideoMAE V2: Scaling Video Masked Autoencoders w...2023-03-29Code
13FluxViT-B90YesMake Your Training Flexible: Towards Deployment-...2025-03-18Code
14MTV-H (WTS 60M)89.9NoMultiview Transformers for Video Recognition2022-01-12Code
15TAdaFormer-L/1489.9NoTemporally-Adaptive Models for Efficient Video U...2023-08-10Code
16EVA89.7NoEVA: Exploring the Limits of Masked Visual Repre...2022-11-14Code
17AM/12 ViT-B Dinov289.6NoAM Flow: Adapters for Temporal Processing in Act...2024-11-04-
18ATM89.4NoWhat Can Simple Arithmetic Operations Do for Tem...2023-07-18Code
19DejaVid89.1Yes--Code
20CoCa (finetuned)88.9NoCoCa: Contrastive Captioners are Image-Text Foun...2022-05-04Code
21BIKE (CLIP ViT-L/14)88.7NoBidirectional Cross-Modal Knowledge Exploration ...2022-12-31Code
22ILA (ViT-L/14)88.7NoImplicit Temporal Modeling with Learnable Alignm...2023-04-20Code
23Side4Video (EVA, ViT-E/14)88.6NoSide4Video: Spatial-Temporal Side Network for Me...2023-11-27Code
24TubeVit-B (ImageNet-1k)88.6NoRethinking Video ViTs: Sparse Video Tubes for Jo...2022-12-06Code
25VideoMAE V2-g88.5YesVideoMAE V2: Scaling Video Masked Autoencoders w...2023-03-29Code
26ONE-PEACE88.1NoONE-PEACE: Exploring One General Representation ...2023-05-18Code
27FluxViT-S88YesMake Your Training Flexible: Towards Deployment-...2025-03-18Code
28CoCa (frozen)88NoCoCa: Contrastive Captioners are Image-Text Foun...2022-05-04Code
29ViT-22B88NoScaling Vision Transformers to 22 Billion Parame...2023-02-10Code
30Text4Vis (CLIP ViT-L/14)87.8NoRevisiting Classifier: Transferring Vision-Langu...2022-07-04Code
31Hiera-H (no extra data)87.8NoHiera: A Hierarchical Vision Transformer without...2023-06-01Code
32EVL (CLIP ViT-L/14@336px, frozen, 32 frames)87.7NoFrozen CLIP Models are Efficient Video Learners2022-08-06Code
33DualPath w/ ViT-L/1487.7NoDual-path Adaptation from Image to Video Transfo...2023-03-17Code
34X-CLIP(ViT-L/14, CLIP)87.7NoExpanding Language-Image Pretrained Models for G...2022-08-04Code
35AIM (CLIP ViT-L/14, 32x224)87.5YesAIM: Adapting Image Models for Efficient Video A...2023-02-06Code
36VideoMAE (no extra data, ViT-H, 32x320x320)87.4NoVideoMAE: Masked Autoencoders are Data-Efficient...2022-03-23Code
37ST-Adapter (ViT-L, CLIP)87.2NoST-Adapter: Parameter-Efficient Image-to-Video T...2022-06-27Code
38ZeroI2V ViT-L/1487.2NoZeroI2V: Zero-Cost Adaptation of Pre-trained Tra...2023-10-02Code
39CoVeR (JFT-3B)87.2NoCo-training Transformer with Videos and Images I...2021-12-14-
40MVD (K400 pretrain, ViT-H, 16x224x224)87.2NoMasked Video Distillation: Rethinking Masked Fea...2022-12-08Code
41mPLUG-287.1NomPLUG-2: A Modularized Multi-modal Foundation Mo...2023-02-01Code
42MaskFeat (K600, MViT-L)87NoMasked Feature Prediction for Self-Supervised Vi...2021-12-16Code
43VicTR (ViT-L/14)87NoVicTR: Video-conditioned Text Representations fo...2023-04-05-
44Video-SwinV2-G (ImageNet-22k and external 70M pretrain)86.8NoSwin Transformer V2: Scaling Up Capacity and Res...2021-11-18Code
45MaskFeat (no extra data, MViT-L)86.7NoMasked Feature Prediction for Self-Supervised Vi...2021-12-16Code
46VideoMAE (no extra data, ViT-H)86.6NoVideoMAE: Masked Autoencoders are Data-Efficient...2022-03-23Code
47MVD (K400 pretrain, ViT-L, 16x224x224)86.4NoMasked Video Distillation: Rethinking Masked Fea...2022-12-08Code
48TAdaConvNeXtV2-B86.4NoTemporally-Adaptive Models for Efficient Video U...2023-08-10Code
49CoVeR (JFT-300M)86.3NoCo-training Transformer with Videos and Images I...2021-12-14-
50VideoMAE (no extra data, ViT-L, 32x320x320)86.1NoVideoMAE: Masked Autoencoders are Data-Efficient...2022-03-23Code
51MViTv2-L (ImageNet-21k pretrain)86.1YesMViTv2: Improved Multiscale Vision Transformers ...2021-12-02Code
52ILA (ViT-B/16)85.7NoImplicit Temporal Modeling with Learnable Alignm...2023-04-20Code
53DualPath w/ ViT-B/1685.4NoDual-path Adaptation from Image to Video Transfo...2023-03-17Code
54TokenLearner 16at18 (L/10)85.4NoTokenLearner: What Can 8 Learned Tokens Do for I...2021-06-21Code
55MAR (50% mask, ViT-L, 16x4)85.3NoMAR: Masked Autoencoders for Efficient Action Re...2022-07-24Code
56CAST(ViT-B/16)85.3NoCAST: Cross-Attention in Space and Time for Vide...2023-11-30Code
57VideoMAE (no extra data, ViT-L, 16x4)85.2NoVideoMAE: Masked Autoencoders are Data-Efficient...2022-03-23Code
58ViC-MAE (ViT-L)85.1NoViC-MAE: Self-Supervised Representation Learning...2023-03-21Code
59VideoMamba-M80085NoVideoMamba: State Space Model for Efficient Vide...2024-03-11Code
60Swin-L (384x384, ImageNet-21k pretrain)84.9NoVideo Swin Transformer2021-06-24Code
61ViViT-H/16x2 (JFT)84.9NoViViT: A Video Vision Transformer2021-03-29Code
62OMNIVORE (Swin-L)84.1NoOmnivore: A Single Model for Many Visual Modalit...2022-01-20Code
63OMNIVORE (Swin-B)84NoOmnivore: A Single Model for Many Visual Modalit...2022-01-20Code
64MAR (75% mask, ViT-L, 16x4)83.9NoMAR: Masked Autoencoders for Efficient Action Re...2022-07-24Code
65ActionCLIP (CLIP-pretrained)83.8NoActionCLIP: A New Paradigm for Video Action Reco...2021-09-17Code
66OmniSource irCSN-152 (IG-Kinetics-65M pretrain)83.6NoOmni-sourced Webly-supervised Learning for Video...2020-03-29Code
67MVD (K400 pretrain, ViT-B, 16x224x224)83.4NoMasked Video Distillation: Rethinking Masked Fea...2022-12-08Code
68StructViT-B-4-183.4NoLearning Correlation Structures for Vision Trans...2024-04-05-
69Swin-L (ImageNet-21k pretrain)83.1NoVideo Swin Transformer2021-06-24Code
70SIFA83.1NoStand-Alone Inter-Frame Attention in Video Models2022-06-14Code
71UniFormer-B (ImageNet-1K)82.9No--Code
72irCSN-152 (IG-Kinetics-65M pretrain)82.8NoLarge-scale weakly-supervised pre-training for v...2019-05-02Code
73DirecFormer82.75NoDirecFormer: A Directed Attention in Transformer...2022-03-19Code
74Swin-B (ImageNet-21k pretrain)82.7NoVideo Swin Transformer2021-06-24Code
75ir-CSN-152 (IG-65M pretraining)82.6NoVideo Classification with Channel-Separated Conv...2019-04-04Code
76ip-CSN-152 (IG-65M pretraining)82.5NoVideo Classification with Channel-Separated Conv...2019-04-04Code
77TPS82.5NoSpatiotemporal Self-attention Modeling with Temp...2022-07-27Code
78ILA (ViT-B/32)82.4NoImplicit Temporal Modeling with Learnable Alignm...2023-04-20Code
79AMD(ViT-B/16)82.2NoAsymmetric Masked Distillation for Pre-Training ...2023-11-06-
80VATT-Large82.1NoVATT: Transformers for Multimodal Self-Supervise...2021-04-22Code
81AdaMAE81.7NoAdaMAE: Adaptive Masking for Efficient Spatiotem...2022-11-16Code
82VideoMAE (no extra data, ViT-B, 16x4)81.5NoVideoMAE: Masked Autoencoders are Data-Efficient...2022-03-23Code
83MoViNet-A681.5NoMoViNets: Mobile Video Networks for Efficient Vi...2021-03-21Code
84MLP-3D81.4NoMLP-3D: A MLP-like 3D Architecture with Grouped ...2022-06-13-
85R[2+1]D-152 (IG-65M pretraining)81.3NoVideo Classification with Channel-Separated Conv...2019-04-04Code
86LGD-3D Two-stream (ResNet-101)81.2NoLearning Spatio-Temporal Representation with Loc...2019-06-13-
87MViT-B, 64x381.2NoMultiscale Vision Transformers2021-04-22Code
88Motionformer-HR81.1NoKeeping Your Eye on the Ball: Trajectory Attenti...2021-06-09Code
89MVD (K400 pretrain, ViT-S, 16x224x224)81NoMasked Video Distillation: Rethinking Masked Fea...2022-12-08Code
90MAR (50% mask, ViT-B, 16x4)81NoMAR: Masked Autoencoders for Efficient Action Re...2022-07-24Code
91MoViNet-A580.9NoMoViNets: Mobile Video Networks for Efficient Vi...2021-03-21Code
92MBT (AV)80.8NoAttention Bottlenecks for Multimodal Fusion2021-06-30Code
93TimeSformer-L80.7NoIs Space-Time Attention All You Need for Video U...2021-02-09Code
94Swin-B (ImageNet-1k pretrain)80.6NoVideo Swin Transformer2021-06-24Code
95Swin-S (ImageNet-1k pretrain)80.6NoVideo Swin Transformer2021-06-24Code
96En-VidTr-L80.5NoVidTr: Video Transformer Without Convolutions2021-04-23-
97MoViNet-A480.5NoMoViNets: Mobile Video Networks for Efficient Vi...2021-03-21Code
98OmniSource SlowOnly R101 8x8(ImageNet pretrain)80.5NoOmni-sourced Webly-supervised Learning for Video...2020-03-29Code
99STAM (64 Frames)80.5NoAn Image is Worth 16x16 Words, What is a Video W...2021-03-25Code
100X3D-XXL80.4NoX3D: Expanding Architectures for Efficient Video...2020-04-09Code
101R3D-RS-20080.4NoRevisiting 3D ResNets for Video Recognition2021-09-03Code
102OmniSource SlowOnly R101 8x8 (Scratch)80.4NoOmni-sourced Webly-supervised Learning for Video...2020-03-29Code
103MViT-B, 32x380.2NoMultiscale Vision Transformers2021-04-22Code
104AMD(ViT-S/16)80.1NoAsymmetric Masked Distillation for Pre-Training ...2023-11-06-
105SlowFast 16x8 (ResNet-101 + NL)79.8YesSlowFast Networks for Video Recognition2018-12-10Code
106CT-Net Ensemble79.8NoCT-Net: Channel Tensorization Network for Video ...2021-06-03Code
107ViT-B-VTN+ ImageNet-21K (84.0 [10])79.8NoVideo Transformer Network2021-02-01Code
108TimeSformer-HR79.7NoIs Space-Time Attention All You Need for Video U...2021-02-09Code
109En-VidTr-M79.7NoVidTr: Video Transformer Without Convolutions2021-04-23-
110LGD-3D RGB (ResNet-101)79.4NoLearning Spatio-Temporal Representation with Loc...2019-06-13-
111TDN-ResNet101 (ensemble, ImageNet pretrained, RGB only)79.4NoTDN: Temporal Difference Networks for Efficient ...2020-12-18Code
112En-VidTr-S79.4NoVidTr: Video Transformer Without Convolutions2021-04-23-
113MAR (75% mask, ViT-B, 16x4)79.4NoMAR: Masked Autoencoders for Efficient Action Re...2022-07-24Code
114STAM (16 Frames)79.3NoAn Image is Worth 16x16 Words, What is a Video W...2021-03-25Code
115ip-CSN-152 (Sports-1M pretraining)79.2NoVideo Classification with Channel-Separated Conv...2019-04-04Code
116CorrNet79.2NoVideo Modeling with Correlation Networks2019-06-07-
117OmniVL79.1NoOmniVL:One Foundation Model for Image-Language a...2022-09-15-
118X3D-XL79.1NoX3D: Expanding Architectures for Efficient Video...2020-04-09Code
119MVFNet-ResNet101 (ensemble, ImageNet pretrained, RGB only)79.1NoMVFNet: Multi-View Fusion Network for Efficient ...2020-12-13Code
120TAdaConvNeXt-T79.1NoTAda! Temporally-Adaptive Convolutions for Video...2021-10-12Code
121SlowFast 16x8 (ResNet-101)78.9NoSlowFast Networks for Video Recognition2018-12-10Code
122G-Blend (Sports-1M pretrain)78.9NoWhat Makes Training Multi-Modal Classification N...2019-05-29Code
123Swin-T (ImageNet-1k pretrain)78.8NoVideo Swin Transformer2021-06-24Code
124GB + DF + LB (ResNet 152, ImageNet pretrained)78.8NoAction recognition with spatial-temporal discrim...2019-08-20-
125ViT-B-VTN (3 layers, ImageNet pretrain)78.6NoVideo Transformer Network2021-02-01Code
126MViT-B, 16x478.4NoMultiscale Vision Transformers2021-04-22Code
127MoViNet-A378.2NoMoViNets: Mobile Video Networks for Efficient Vi...2021-03-21Code
128TAda2D-En (ResNet-50, 8+16 frames)78.2NoTAda! Temporally-Adaptive Convolutions for Video...2021-10-12Code
129SVT78.1NoSelf-supervised Video Transformer2021-12-02Code
130TimeSformer78NoIs Space-Time Attention All You Need for Video U...2021-02-09Code
131SlowFast 8x8 (ResNet-101)77.9NoSlowFast Networks for Video Recognition2018-12-10Code
132RepFlow-50 ([2+1]D CNN, FcF, Non-local block)77.9NoRepresentation Flow for Action Recognition2018-10-02Code
133ip-CSN-15277.8NoVideo Classification with Channel-Separated Conv...2019-04-04Code
134I3D + NL77.7NoNon-local Neural Networks2017-11-21Code
135G-Blend77.7NoWhat Makes Training Multi-Modal Classification N...2019-05-29Code
136HATNet (32 frames)77.6NoLarge Scale Holistic Video Understanding2019-04-25Code
137X3D-L77.5NoX3D: Expanding Architectures for Efficient Video...2020-04-09Code
138CoST ResNet-101 (ImageNet pretrain)77.5No--Code
139TAda2D (ResNet-50, 16 frames)77.4NoTAda! Temporally-Adaptive Convolutions for Video...2021-10-12Code
140EvaNet77.4NoEvolving Space-Time Neural Architectures for Vid...2018-11-26-
141RNL+TSM Ensemble(ResNet50, 8 + 16 frames)77.4NoRegion-based Non-local Operation for Video Class...2020-07-17Code
142VIMPAC77.4NoVIMPAC: Video Pre-Training via Masked Token Pred...2021-06-21Code
143BQN (ResNet-50)77.3NoBusy-Quiet Video Disentangling for Video Classif...2021-03-29Code
144S3D-G (RGB+Flow, ImageNet pretrained)77.2NoRethinking Spatiotemporal Feature Learning: Spee...2017-12-13Code
145SlowFast 8x8 (ResNet-50)77NoSlowFast Networks for Video Recognition2018-12-10Code
146TAda2D (ResNet-50, 8 frames)76.7NoTAda! Temporally-Adaptive Convolutions for Video...2021-10-12Code
147D3D+S3D-G (RGB + RGB)76.5NoD3D: Distilled 3D Networks for Video Action Reco...2018-12-19Code
148MSNet-R50 (16 frames, ImageNet pretrained)76.4NoMotionSqueeze: Neural Motion Feature Learning fo...2020-07-20Code
149GloRe76.1NoGlobal Textual Relation Embedding for Relational...2019-06-03Code
150X3D-M76NoX3D: Expanding Architectures for Efficient Video...2020-04-09Code
151MViT-S76NoMultiscale Vision Transformers2021-04-22Code
152CMA iter1 (16 frames)75.98NoTwo-Stream Video Classification with Cross-Modal...2019-08-01-
153D3D (RGB)75.9NoD3D: Distilled 3D Networks for Video Action Reco...2018-12-19Code
154Oct-I3D + NL75.7NoDrop an Octave: Reducing Spatial Redundancy in C...2019-04-10Code
155SlowFast 4x16 (ResNet-50)75.6NoSlowFast Networks for Video Recognition2018-12-10Code
156R[2+1]D-Flow (Sports-1M pretrain)75.4NoA Closer Look at Spatiotemporal Convolutions for...2017-11-30Code
157FASTER3275.1NoFASTER Recurrent Networks for Efficient Video Cl...2019-06-10-
158MoViNet-A275NoMoViNets: Mobile Video Networks for Efficient Vi...2021-03-21Code
159MARS+RGB+Flow (64 frames)74.9No--Code
160S3D-G (RGB, ImageNet pretrained)74.7NoRethinking Spatiotemporal Feature Learning: Spee...2017-12-13Code
161TSM74.7NoTSM: Temporal Shift Module for Efficient Video U...2018-11-20Code
162A2 Net74.6No$A^2$-Nets: Double Attention Networks2018-10-27-
163R[2+1]D-RGB (Sports-1M pretrain)74.3NoA Closer Look at Spatiotemporal Convolutions for...2017-11-30Code
164TSN73.9NoConvNet Architecture Search for Spatiotemporal F...2017-08-16Code
165R[2+1]D-Two-Stream73.9NoA Closer Look at Spatiotemporal Convolutions for...2017-11-30Code
166TSN73.9NoConvNet Architecture Search for Spatiotemporal F...2017-08-16Code
167STM (ResNet-50)73.7NoSTM: SpatioTemporal and Motion Encoding for Acti...2019-08-07-
168bLVNet Fan et al. (2019)73.5NoMore Is Less: Learning Efficient Video Represent...2019-12-02Code
169Co Slow_6473.05NoContinual 3D Convolutional Neural Networks for R...2021-05-31Code
170Inception-ResNet73NoRevisiting the Effectiveness of Off-the-shelf Te...2017-08-12-
171MFNet72.8NoMulti-Fiber Networks for Video Recognition2018-07-30-
172MoViNet-A172.7NoMoViNets: Mobile Video Networks for Efficient Vi...2021-03-21Code
173ARTNet72.4NoAppearance-and-Relation Networks for Video Class...2017-11-24Code
174LGD-3D Flow (ResNet-101)72.3NoLearning Spatio-Temporal Representation with Loc...2019-06-13-
175R[2+1]D72NoA Closer Look at Spatiotemporal Convolutions for...2017-11-30Code
176R[2+1]D-RGB72NoA Closer Look at Spatiotemporal Convolutions for...2017-11-30Code
177FASTER16 w/o sp71.7NoFASTER Recurrent Networks for Efficient Video Cl...2019-06-10-
178Co X3D-L_6471.61NoContinual 3D Convolutional Neural Networks for R...2021-05-31Code
179I3D71.1NoQuo Vadis, Action Recognition? A New Model and t...2017-05-22Code
180Co X3D-M_6471.03NoContinual 3D Convolutional Neural Networks for R...2021-05-31Code
181X3D-L69.29NoContinual 3D Convolutional Neural Networks for R...2021-05-31Code
182MARS+RGB+Flow (16 frames)68.9No--Code
183SlowFast-8×8-R5068.45NoContinual 3D Convolutional Neural Networks for R...2021-05-31Code
184S3D-G (Flow, ImageNet pretrained)68NoRethinking Spatiotemporal Feature Learning: Spee...2017-12-13Code
185R[2+1]D-Flow67.5NoA Closer Look at Spatiotemporal Convolutions for...2017-11-30Code
186Slow-8x8-R5067.42NoContinual 3D Convolutional Neural Networks for R...2021-05-31Code
187Co X3D-S_6467.33NoContinual 3D Convolutional Neural Networks for R...2021-05-31Code
188X3D-M67.24NoContinual 3D Convolutional Neural Networks for R...2021-05-31Code
189SlowFast-4×16-R5067.06NoContinual 3D Convolutional Neural Networks for R...2021-05-31Code
190Co Slow_865.9NoContinual 3D Convolutional Neural Networks for R...2021-05-31Code
191MoViNet-A065.8NoMoViNets: Mobile Video Networks for Efficient Vi...2021-03-21Code
192X3D-S64.71NoContinual 3D Convolutional Neural Networks for R...2021-05-31Code
193I3D-R5063.98NoContinual 3D Convolutional Neural Networks for R...2021-05-31Code
194Co X3D-L_1663.03NoContinual 3D Convolutional Neural Networks for R...2021-05-31Code
195Co X3D-M_1662.8NoContinual 3D Convolutional Neural Networks for R...2021-05-31Code
196Co X3D-S_1360.18NoContinual 3D Convolutional Neural Networks for R...2021-05-31Code
197Co I3D_859.58NoContinual 3D Convolutional Neural Networks for R...2021-05-31Code
198R(2+1)D-18_1659.52NoContinual 3D Convolutional Neural Networks for R...2021-05-31Code
199X3D-XS59.37NoContinual 3D Convolutional Neural Networks for R...2021-05-31Code
200Co I3D_6456.86NoContinual 3D Convolutional Neural Networks for R...2021-05-31Code
201R(2+1)D-18_853.52NoContinual 3D Convolutional Neural Networks for R...2021-05-31Code
202RCU_853.4NoContinual 3D Convolutional Neural Networks for R...2021-05-31Code