TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

SotA/Computer Vision/Video/Kinetics-600

Video on Kinetics-600

Metric: Top-5 Accuracy (higher is better)

LeaderboardDataset
Loading chart...

Results

Submit a result
#Model↕Top-5 Accuracy▼Extra DataPaperDate↕Code
1TubeVit-H98.9YesRethinking Video ViTs: Sparse Video Tubes for Jo...2022-12-06Code
2UMT-L (ViT-L/16)98.8YesUnmasked Teacher: Towards Training-Efficient Vid...2023-03-28Code
3TubeVit-L98.7YesRethinking Video ViTs: Sparse Video Tubes for Jo...2022-12-06Code
4MTV-H (WTS 60M)98.5YesMultiview Transformers for Video Recognition2022-01-12Code
5UniFormerV2-L98.5Yes--Code
6VideoMAE V2-g (64x266x266)98.5YesVideoMAE V2: Scaling Video Masked Autoencoders w...2023-03-29Code
7mPLUG-298.3YesmPLUG-2: A Modularized Multi-modal Foundation Mo...2023-02-01Code
8VideoMAE V2-g98.2YesVideoMAE V2: Scaling Video Masked Autoencoders w...2023-03-29Code
9MaskFeat (no extra data, MViT-L)98NoMasked Feature Prediction for Self-Supervised Vi...2021-12-16Code
10MViTv2-L (ImageNet-21k pretrain)97.9NoMViTv2: Improved Multiscale Vision Transformers ...2021-12-02Code
11Florence (curated FLD-900M pretrain)97.9YesFlorence: A New Foundation Model for Computer Vi...2021-11-22Code
12CoVeR (JFT-3B)97.8YesCo-training Transformer with Videos and Images I...2021-12-14-
13X-CLIP(ViT-L/14, CLIP)97.7YesExpanding Language-Image Pretrained Models for G...2022-08-04Code
14TubeVit-B97.3YesRethinking Video ViTs: Sparse Video Tubes for Jo...2022-12-06Code
15CoVeR (JFT-300M)97.3YesCo-training Transformer with Videos and Images I...2021-12-14-
16Swin-L (384x384, ImageNet-21k pretrain)97.3YesVideo Swin Transformer2021-06-24Code
17MViTv2-B (train from scratch)97.2NoMViTv2: Improved Multiscale Vision Transformers ...2021-12-02Code
18🍷MerlotReserve-Large (+Audio)97.1YesMERLOT Reserve: Neural Script Knowledge through ...2022-01-07-
19TokenLearner 16at18 w. Fuser (L/10)97YesTokenLearner: What Can 8 Learned Tokens Do for I...2021-06-21Code
20UniFormer-B (ImageNet-1K)96.7Yes--Code
21🍷MerlotReserve-Base (+Audio)96.6YesMERLOT Reserve: Neural Script Knowledge through ...2022-01-07-
22VATT-Large96.6YesVATT: Transformers for Multimodal Self-Supervise...2021-04-22Code
23ViViT-H/16x2 (JFT)96.5YesViViT: A Video Vision Transformer2021-03-29Code
24Swin-B (ImageNet-21k pretrain)96.5YesVideo Swin Transformer2021-06-24Code
25MoViNet-A696.5NoMoViNets: Mobile Video Networks for Efficient Vi...2021-03-21Code
26MoViNet-A5 (AutoAugment)96.4NoMoViNets: Mobile Video Networks for Efficient Vi...2021-03-21Code
27🍷MerlotReserve-Large (no Audio)96.3YesMERLOT Reserve: Neural Script Knowledge through ...2022-01-07-
28XViT (x16)96.3NoSpace-time Mixing Attention for Video Transformer2021-06-10Code
29MViT-B-24, 32x396.3NoMultiscale Vision Transformers2021-04-22Code
30MViT-B, 32x396.3NoMultiscale Vision Transformers2021-04-22Code
31LGD-3D Two-stream96.2NoLearning Spatio-Temporal Representation with Loc...2019-06-13-
32🍷MerlotReserve-Base (no Audio)95.8YesMERLOT Reserve: Neural Script Knowledge through ...2022-01-07-
33ViViT-L/16x2 (320x320)95.7NoViViT: A Video Vision Transformer2021-03-29Code
34MoViNet-A595.7NoMoViNets: Mobile Video Networks for Efficient Vi...2021-03-21Code
35MViT-B, 16x495.7NoMultiscale Vision Transformers2021-04-22Code
36PERF-Net (distilled ResNet50-G)95.7NoPERF-Net: Pose Empowered RGB-Flow Net2020-09-28-
37ViViT-L/16x295.6NoViViT: A Video Vision Transformer2021-03-29Code
38LGD-3D RGB95.6NoLearning Spatio-Temporal Representation with Loc...2019-06-13-
39SlowFast 16x8 (ResNet-101 + NL)95.1NoSlowFast Networks for Video Recognition2018-12-10Code
40SlowFast 16x8 (ResNet-101)95.1NoSlowFast Networks for Video Recognition2018-12-10Code
41MoViNet-A494.9NoMoViNets: Mobile Video Networks for Efficient Vi...2021-03-21Code
42SlowFast 8x8 (ResNet-101)94.8NoSlowFast Networks for Video Recognition2018-12-10Code
43SlowFast 8x8 (ResNet-50)94.5NoSlowFast Networks for Video Recognition2018-12-10Code
44SlowFast 4x16 (ResNet-50)94NoSlowFast Networks for Video Recognition2018-12-10Code
45MoViNet-A293.4NoMoViNets: Mobile Video Networks for Efficient Vi...2021-03-21Code
46MoViNet-A192.6NoMoViNets: Mobile Video Networks for Efficient Vi...2021-03-21Code
47LGD-3D Flow92.4NoLearning Spatio-Temporal Representation with Loc...2019-06-13-
48MoViNet-A090.4NoMoViNets: Mobile Video Networks for Efficient Vi...2021-03-21Code
49MoViNet-A380.8NoMoViNets: Mobile Video Networks for Efficient Vi...2021-03-21Code