TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

SotA/Computer Vision/Video/Kinetics-600

Video on Kinetics-600

Metric: Top-1 Accuracy (higher is better)

LeaderboardDataset
Loading chart...

Results

Submit a result
#Model↕Top-1 Accuracy▼Extra DataPaperDate↕Code
1InternVideo2-6B91.9YesInternVideo2: Scaling Foundation Models for Mult...2024-03-22Code
2TubeVit-H91.8YesRethinking Video ViTs: Sparse Video Tubes for Jo...2022-12-06Code
3InternVideo2-1B91.6YesInternVideo2: Scaling Foundation Models for Mult...2024-03-22Code
4TubeVit-L91.5YesRethinking Video ViTs: Sparse Video Tubes for Jo...2022-12-06Code
5InternVideo-T91.3YesInternVideo: General Video Foundation Models via...2022-12-06Code
6🍷MerlotReserve-Large (+Audio)91.1YesMERLOT Reserve: Neural Script Knowledge through ...2022-01-07-
7TubeVit-B90.9YesRethinking Video ViTs: Sparse Video Tubes for Jo...2022-12-06Code
8UMT-L (ViT-L/16)90.5YesUnmasked Teacher: Towards Training-Efficient Vid...2023-03-28Code
9MTV-H (WTS 60M)90.3YesMultiview Transformers for Video Recognition2022-01-12Code
10UniFormerV2-L90.1Yes--Code
11VideoMAE V2-g (64x266x266)89.9YesVideoMAE V2: Scaling Video Masked Autoencoders w...2023-03-29Code
12mPLUG-289.8YesmPLUG-2: A Modularized Multi-modal Foundation Mo...2023-02-01Code
13🍷MerlotReserve-Base (+Audio)89.7YesMERLOT Reserve: Neural Script Knowledge through ...2022-01-07-
14🍷MerlotReserve-Large (no Audio)89.4YesMERLOT Reserve: Neural Script Knowledge through ...2022-01-07-
15CoCa (finetuned)89.4YesCoCa: Contrastive Captioners are Image-Text Foun...2022-05-04Code
16VideoMAE V2-g88.8YesVideoMAE V2: Scaling Video Masked Autoencoders w...2023-03-29Code
17Hiera-H (no extra data)88.8NoHiera: A Hierarchical Vision Transformer without...2023-06-01Code
18CoCa (frozen)88.5YesCoCa: Contrastive Captioners are Image-Text Foun...2022-05-04Code
19MaskFeat (no extra data, MViT-L)88.3NoMasked Feature Prediction for Self-Supervised Vi...2021-12-16Code
20X-CLIP(ViT-L/14, CLIP)88.3YesExpanding Language-Image Pretrained Models for G...2022-08-04Code
21🍷MerlotReserve-Base (no Audio)88.1YesMERLOT Reserve: Neural Script Knowledge through ...2022-01-07-
22MViTv2-L (ImageNet-21k pretrain)87.9NoMViTv2: Improved Multiscale Vision Transformers ...2021-12-02Code
23CoVeR (JFT-3B)87.9YesCo-training Transformer with Videos and Images I...2021-12-14-
24Florence (curated FLD-900M pretrain)87.8YesFlorence: A New Foundation Model for Computer Vi...2021-11-22Code
25CoVeR (JFT-300M)86.8YesCo-training Transformer with Videos and Images I...2021-12-14-
26TokenLearner 16at18 w. Fuser (L/10)86.3YesTokenLearner: What Can 8 Learned Tokens Do for I...2021-06-21Code
27Swin-L (384x384, ImageNet-21k pretrain)86.1YesVideo Swin Transformer2021-06-24Code
28ViViT-H/16x2 (JFT)85.8YesViViT: A Video Vision Transformer2021-03-29Code
29MViTv2-L (train from scratch)85.5NoMViTv2: Improved Multiscale Vision Transformers ...2021-12-02Code
30UniFormer-B (ImageNet-1K)84.8Yes--Code
31XViT (x16)84.5NoSpace-time Mixing Attention for Video Transformer2021-06-10Code
32MoViNet-A5 (AutoAugment)84.3NoMoViNets: Mobile Video Networks for Efficient Vi...2021-03-21Code
33ViViT-L/16x284.3NoViViT: A Video Vision Transformer2021-03-29Code
34Swin-B (ImageNet-21k pretrain)84YesVideo Swin Transformer2021-06-24Code
35MViT-B-24, 32x383.8NoMultiscale Vision Transformers2021-04-22Code
36VATT-Large83.6YesVATT: Transformers for Multimodal Self-Supervise...2021-04-22Code
37MoViNet-A683.5NoMoViNets: Mobile Video Networks for Efficient Vi...2021-03-21Code
38MViT-B, 32x383.4NoMultiscale Vision Transformers2021-04-22Code
39LGD-3D Two-stream83.1NoLearning Spatio-Temporal Representation with Loc...2019-06-13-
40R3D-RS-20083.1NoRevisiting 3D ResNets for Video Recognition2021-09-03Code
41ViViT-L/16x2 (320x320)83NoViViT: A Video Vision Transformer2021-03-29Code
42MoViNet-A582.7NoMoViNets: Mobile Video Networks for Efficient Vi...2021-03-21Code
43MViT-B, 16x482.1NoMultiscale Vision Transformers2021-04-22Code
44PERF-Net (distilled ResNet50-G)82NoPERF-Net: Pose Empowered RGB-Flow Net2020-09-28-
45SlowFast 16x8 (ResNet-101 + NL)81.8NoSlowFast Networks for Video Recognition2018-12-10Code
46LGD-3D RGB81.5NoLearning Spatio-Temporal Representation with Loc...2019-06-13-
47MoViNet-A481.2NoMoViNets: Mobile Video Networks for Efficient Vi...2021-03-21Code
48SlowFast 16x8 (ResNet-101)81.1NoSlowFast Networks for Video Recognition2018-12-10Code
49MoViNet-A380.8NoMoViNets: Mobile Video Networks for Efficient Vi...2021-03-21Code
50SlowFast 8x8 (ResNet-101)80.4NoSlowFast Networks for Video Recognition2018-12-10Code
51SlowFast 8x8 (ResNet-50)79.9NoSlowFast Networks for Video Recognition2018-12-10Code
52D3D+S3D-G79.1NoD3D: Distilled 3D Networks for Video Action Reco...2018-12-19Code
53SlowFast 4x16 (ResNet-50)78.8NoSlowFast Networks for Video Recognition2018-12-10Code
54S3D-G (RGB+Flow)78.6NoRethinking Spatiotemporal Feature Learning: Spee...2017-12-13Code
55D3D77.9NoD3D: Distilled 3D Networks for Video Action Reco...2018-12-19Code
56MoViNet-A277.5NoMoViNets: Mobile Video Networks for Efficient Vi...2021-03-21Code
57S3D-G (RGB)76.6NoRethinking Spatiotemporal Feature Learning: Spee...2017-12-13Code
58MoViNet-A176NoMoViNets: Mobile Video Networks for Efficient Vi...2021-03-21Code
59LGD-3D Flow75NoLearning Spatio-Temporal Representation with Loc...2019-06-13-
60I3D (RGB)73.6NoA Short Note about Kinetics-6002018-08-03Code
61MoViNet-A071.5NoMoViNets: Mobile Video Networks for Efficient Vi...2021-03-21Code
62S3D-G (Flow)69.7NoRethinking Spatiotemporal Feature Learning: Spee...2017-12-13Code