Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Computer Vision
/
Video
/
Kinetics-600
Video on Kinetics-600
Metric: Top-5 Accuracy (higher is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Hide extra data
Export CSV
Sort:
Top-5 Accuracy (best first)
Top-5 Accuracy (worst first)
Date (newest first)
Date (oldest first)
Model name (A→Z)
#
Model
↕
Top-5 Accuracy
▼
Extra Data
Paper
Date
↕
Code
1
TubeVit-H
98.9
Yes
Rethinking Video ViTs: Sparse Video Tubes for Jo...
2022-12-06
Code
2
UMT-L (ViT-L/16)
98.8
Yes
Unmasked Teacher: Towards Training-Efficient Vid...
2023-03-28
Code
3
TubeVit-L
98.7
Yes
Rethinking Video ViTs: Sparse Video Tubes for Jo...
2022-12-06
Code
4
MTV-H (WTS 60M)
98.5
Yes
Multiview Transformers for Video Recognition
2022-01-12
Code
5
UniFormerV2-L
98.5
Yes
-
-
Code
6
VideoMAE V2-g (64x266x266)
98.5
Yes
VideoMAE V2: Scaling Video Masked Autoencoders w...
2023-03-29
Code
7
mPLUG-2
98.3
Yes
mPLUG-2: A Modularized Multi-modal Foundation Mo...
2023-02-01
Code
8
VideoMAE V2-g
98.2
Yes
VideoMAE V2: Scaling Video Masked Autoencoders w...
2023-03-29
Code
9
MaskFeat (no extra data, MViT-L)
98
No
Masked Feature Prediction for Self-Supervised Vi...
2021-12-16
Code
10
MViTv2-L (ImageNet-21k pretrain)
97.9
No
MViTv2: Improved Multiscale Vision Transformers ...
2021-12-02
Code
11
Florence (curated FLD-900M pretrain)
97.9
Yes
Florence: A New Foundation Model for Computer Vi...
2021-11-22
Code
12
CoVeR (JFT-3B)
97.8
Yes
Co-training Transformer with Videos and Images I...
2021-12-14
-
13
X-CLIP(ViT-L/14, CLIP)
97.7
Yes
Expanding Language-Image Pretrained Models for G...
2022-08-04
Code
14
TubeVit-B
97.3
Yes
Rethinking Video ViTs: Sparse Video Tubes for Jo...
2022-12-06
Code
15
CoVeR (JFT-300M)
97.3
Yes
Co-training Transformer with Videos and Images I...
2021-12-14
-
16
Swin-L (384x384, ImageNet-21k pretrain)
97.3
Yes
Video Swin Transformer
2021-06-24
Code
17
MViTv2-B (train from scratch)
97.2
No
MViTv2: Improved Multiscale Vision Transformers ...
2021-12-02
Code
18
🍷MerlotReserve-Large (+Audio)
97.1
Yes
MERLOT Reserve: Neural Script Knowledge through ...
2022-01-07
-
19
TokenLearner 16at18 w. Fuser (L/10)
97
Yes
TokenLearner: What Can 8 Learned Tokens Do for I...
2021-06-21
Code
20
UniFormer-B (ImageNet-1K)
96.7
Yes
-
-
Code
21
🍷MerlotReserve-Base (+Audio)
96.6
Yes
MERLOT Reserve: Neural Script Knowledge through ...
2022-01-07
-
22
VATT-Large
96.6
Yes
VATT: Transformers for Multimodal Self-Supervise...
2021-04-22
Code
23
ViViT-H/16x2 (JFT)
96.5
Yes
ViViT: A Video Vision Transformer
2021-03-29
Code
24
Swin-B (ImageNet-21k pretrain)
96.5
Yes
Video Swin Transformer
2021-06-24
Code
25
MoViNet-A6
96.5
No
MoViNets: Mobile Video Networks for Efficient Vi...
2021-03-21
Code
26
MoViNet-A5 (AutoAugment)
96.4
No
MoViNets: Mobile Video Networks for Efficient Vi...
2021-03-21
Code
27
🍷MerlotReserve-Large (no Audio)
96.3
Yes
MERLOT Reserve: Neural Script Knowledge through ...
2022-01-07
-
28
XViT (x16)
96.3
No
Space-time Mixing Attention for Video Transformer
2021-06-10
Code
29
MViT-B-24, 32x3
96.3
No
Multiscale Vision Transformers
2021-04-22
Code
30
MViT-B, 32x3
96.3
No
Multiscale Vision Transformers
2021-04-22
Code
31
LGD-3D Two-stream
96.2
No
Learning Spatio-Temporal Representation with Loc...
2019-06-13
-
32
🍷MerlotReserve-Base (no Audio)
95.8
Yes
MERLOT Reserve: Neural Script Knowledge through ...
2022-01-07
-
33
ViViT-L/16x2 (320x320)
95.7
No
ViViT: A Video Vision Transformer
2021-03-29
Code
34
MoViNet-A5
95.7
No
MoViNets: Mobile Video Networks for Efficient Vi...
2021-03-21
Code
35
MViT-B, 16x4
95.7
No
Multiscale Vision Transformers
2021-04-22
Code
36
PERF-Net (distilled ResNet50-G)
95.7
No
PERF-Net: Pose Empowered RGB-Flow Net
2020-09-28
-
37
ViViT-L/16x2
95.6
No
ViViT: A Video Vision Transformer
2021-03-29
Code
38
LGD-3D RGB
95.6
No
Learning Spatio-Temporal Representation with Loc...
2019-06-13
-
39
SlowFast 16x8 (ResNet-101 + NL)
95.1
No
SlowFast Networks for Video Recognition
2018-12-10
Code
40
SlowFast 16x8 (ResNet-101)
95.1
No
SlowFast Networks for Video Recognition
2018-12-10
Code
41
MoViNet-A4
94.9
No
MoViNets: Mobile Video Networks for Efficient Vi...
2021-03-21
Code
42
SlowFast 8x8 (ResNet-101)
94.8
No
SlowFast Networks for Video Recognition
2018-12-10
Code
43
SlowFast 8x8 (ResNet-50)
94.5
No
SlowFast Networks for Video Recognition
2018-12-10
Code
44
SlowFast 4x16 (ResNet-50)
94
No
SlowFast Networks for Video Recognition
2018-12-10
Code
45
MoViNet-A2
93.4
No
MoViNets: Mobile Video Networks for Efficient Vi...
2021-03-21
Code
46
MoViNet-A1
92.6
No
MoViNets: Mobile Video Networks for Efficient Vi...
2021-03-21
Code
47
LGD-3D Flow
92.4
No
Learning Spatio-Temporal Representation with Loc...
2019-06-13
-
48
MoViNet-A0
90.4
No
MoViNets: Mobile Video Networks for Efficient Vi...
2021-03-21
Code
49
MoViNet-A3
80.8
No
MoViNets: Mobile Video Networks for Efficient Vi...
2021-03-21
Code
#1
TubeVit-H
SOTA
98.9
Top-5 Accuracy
· Extra Data
· 2022-12-06
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning
Code
#2
UMT-L (ViT-L/16)
98.8
Top-5 Accuracy
· Extra Data
· 2023-03-28
Unmasked Teacher: Towards Training-Efficient Video Foundation Models
Code
#3
TubeVit-L
98.7
Top-5 Accuracy
· Extra Data
· 2022-12-06
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning
Code
#4
MTV-H (WTS 60M)
SOTA
98.5
Top-5 Accuracy
· Extra Data
· 2022-01-12
Multiview Transformers for Video Recognition
Code
#5
UniFormerV2-L
98.5
Top-5 Accuracy
· Extra Data
No paper
Code
#6
VideoMAE V2-g (64x266x266)
98.5
Top-5 Accuracy
· Extra Data
· 2023-03-29
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
Code
#7
mPLUG-2
98.3
Top-5 Accuracy
· Extra Data
· 2023-02-01
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
Code
#8
VideoMAE V2-g
98.2
Top-5 Accuracy
· Extra Data
· 2023-03-29
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
Code
#9
MaskFeat (no extra data, MViT-L)
SOTA
98
Top-5 Accuracy
· 2021-12-16
Masked Feature Prediction for Self-Supervised Visual Pre-Training
Code
#10
MViTv2-L (ImageNet-21k pretrain)
97.9
Top-5 Accuracy
· 2021-12-02
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection
Code
#11
Florence (curated FLD-900M pretrain)
SOTA
97.9
Top-5 Accuracy
· Extra Data
· 2021-11-22
Florence: A New Foundation Model for Computer Vision
Code
#12
CoVeR (JFT-3B)
97.8
Top-5 Accuracy
· Extra Data
· 2021-12-14
Co-training Transformer with Videos and Images Improves Action Recognition
#13
X-CLIP(ViT-L/14, CLIP)
97.7
Top-5 Accuracy
· Extra Data
· 2022-08-04
Expanding Language-Image Pretrained Models for General Video Recognition
Code
#14
TubeVit-B
97.3
Top-5 Accuracy
· Extra Data
· 2022-12-06
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning
Code
#15
CoVeR (JFT-300M)
97.3
Top-5 Accuracy
· Extra Data
· 2021-12-14
Co-training Transformer with Videos and Images Improves Action Recognition
#16
Swin-L (384x384, ImageNet-21k pretrain)
SOTA
97.3
Top-5 Accuracy
· Extra Data
· 2021-06-24
Video Swin Transformer
Code
#17
MViTv2-B (train from scratch)
97.2
Top-5 Accuracy
· 2021-12-02
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection
Code
#18
🍷MerlotReserve-Large (+Audio)
97.1
Top-5 Accuracy
· Extra Data
· 2022-01-07
MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound
#19
TokenLearner 16at18 w. Fuser (L/10)
SOTA
97
Top-5 Accuracy
· Extra Data
· 2021-06-21
TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?
Code
#20
UniFormer-B (ImageNet-1K)
96.7
Top-5 Accuracy
· Extra Data
No paper
Code
#21
🍷MerlotReserve-Base (+Audio)
96.6
Top-5 Accuracy
· Extra Data
· 2022-01-07
MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound
#22
VATT-Large
SOTA
96.6
Top-5 Accuracy
· Extra Data
· 2021-04-22
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
Code
#23
ViViT-H/16x2 (JFT)
96.5
Top-5 Accuracy
· Extra Data
· 2021-03-29
ViViT: A Video Vision Transformer
Code
#24
Swin-B (ImageNet-21k pretrain)
96.5
Top-5 Accuracy
· Extra Data
· 2021-06-24
Video Swin Transformer
Code
#25
MoViNet-A6
SOTA
96.5
Top-5 Accuracy
· 2021-03-21
MoViNets: Mobile Video Networks for Efficient Video Recognition
Code
#26
MoViNet-A5 (AutoAugment)
96.4
Top-5 Accuracy
· 2021-03-21
MoViNets: Mobile Video Networks for Efficient Video Recognition
Code
#27
🍷MerlotReserve-Large (no Audio)
96.3
Top-5 Accuracy
· Extra Data
· 2022-01-07
MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound
#28
XViT (x16)
96.3
Top-5 Accuracy
· 2021-06-10
Space-time Mixing Attention for Video Transformer
Code
#29
MViT-B-24, 32x3
96.3
Top-5 Accuracy
· 2021-04-22
Multiscale Vision Transformers
Code
#30
MViT-B, 32x3
96.3
Top-5 Accuracy
· 2021-04-22
Multiscale Vision Transformers
Code
#31
LGD-3D Two-stream
SOTA
96.2
Top-5 Accuracy
· 2019-06-13
Learning Spatio-Temporal Representation with Local and Global Diffusion
#32
🍷MerlotReserve-Base (no Audio)
95.8
Top-5 Accuracy
· Extra Data
· 2022-01-07
MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound
#33
ViViT-L/16x2 (320x320)
95.7
Top-5 Accuracy
· 2021-03-29
ViViT: A Video Vision Transformer
Code
#34
MoViNet-A5
95.7
Top-5 Accuracy
· 2021-03-21
MoViNets: Mobile Video Networks for Efficient Video Recognition
Code
#35
MViT-B, 16x4
95.7
Top-5 Accuracy
· 2021-04-22
Multiscale Vision Transformers
Code
#36
PERF-Net (distilled ResNet50-G)
95.7
Top-5 Accuracy
· 2020-09-28
PERF-Net: Pose Empowered RGB-Flow Net
#37
ViViT-L/16x2
95.6
Top-5 Accuracy
· 2021-03-29
ViViT: A Video Vision Transformer
Code
#38
LGD-3D RGB
95.6
Top-5 Accuracy
· 2019-06-13
Learning Spatio-Temporal Representation with Local and Global Diffusion
#39
SlowFast 16x8 (ResNet-101 + NL)
SOTA
95.1
Top-5 Accuracy
· 2018-12-10
SlowFast Networks for Video Recognition
Code
#40
SlowFast 16x8 (ResNet-101)
95.1
Top-5 Accuracy
· 2018-12-10
SlowFast Networks for Video Recognition
Code
#41
MoViNet-A4
94.9
Top-5 Accuracy
· 2021-03-21
MoViNets: Mobile Video Networks for Efficient Video Recognition
Code
#42
SlowFast 8x8 (ResNet-101)
94.8
Top-5 Accuracy
· 2018-12-10
SlowFast Networks for Video Recognition
Code
#43
SlowFast 8x8 (ResNet-50)
94.5
Top-5 Accuracy
· 2018-12-10
SlowFast Networks for Video Recognition
Code
#44
SlowFast 4x16 (ResNet-50)
94
Top-5 Accuracy
· 2018-12-10
SlowFast Networks for Video Recognition
Code
#45
MoViNet-A2
93.4
Top-5 Accuracy
· 2021-03-21
MoViNets: Mobile Video Networks for Efficient Video Recognition
Code
#46
MoViNet-A1
92.6
Top-5 Accuracy
· 2021-03-21
MoViNets: Mobile Video Networks for Efficient Video Recognition
Code
#47
LGD-3D Flow
92.4
Top-5 Accuracy
· 2019-06-13
Learning Spatio-Temporal Representation with Local and Global Diffusion
#48
MoViNet-A0
90.4
Top-5 Accuracy
· 2021-03-21
MoViNets: Mobile Video Networks for Efficient Video Recognition
Code
#49
MoViNet-A3
80.8
Top-5 Accuracy
· 2021-03-21
MoViNets: Mobile Video Networks for Efficient Video Recognition
Code