Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Computer Vision
/
Video
/
Kinetics-600
Video on Kinetics-600
Metric: Top-1 Accuracy (higher is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Hide extra data
Export CSV
Sort:
Top-1 Accuracy (best first)
Top-1 Accuracy (worst first)
Date (newest first)
Date (oldest first)
Model name (A→Z)
#
Model
↕
Top-1 Accuracy
▼
Extra Data
Paper
Date
↕
Code
1
InternVideo2-6B
91.9
Yes
InternVideo2: Scaling Foundation Models for Mult...
2024-03-22
Code
2
TubeVit-H
91.8
Yes
Rethinking Video ViTs: Sparse Video Tubes for Jo...
2022-12-06
Code
3
InternVideo2-1B
91.6
Yes
InternVideo2: Scaling Foundation Models for Mult...
2024-03-22
Code
4
TubeVit-L
91.5
Yes
Rethinking Video ViTs: Sparse Video Tubes for Jo...
2022-12-06
Code
5
InternVideo-T
91.3
Yes
InternVideo: General Video Foundation Models via...
2022-12-06
Code
6
🍷MerlotReserve-Large (+Audio)
91.1
Yes
MERLOT Reserve: Neural Script Knowledge through ...
2022-01-07
-
7
TubeVit-B
90.9
Yes
Rethinking Video ViTs: Sparse Video Tubes for Jo...
2022-12-06
Code
8
UMT-L (ViT-L/16)
90.5
Yes
Unmasked Teacher: Towards Training-Efficient Vid...
2023-03-28
Code
9
MTV-H (WTS 60M)
90.3
Yes
Multiview Transformers for Video Recognition
2022-01-12
Code
10
UniFormerV2-L
90.1
Yes
-
-
Code
11
VideoMAE V2-g (64x266x266)
89.9
Yes
VideoMAE V2: Scaling Video Masked Autoencoders w...
2023-03-29
Code
12
mPLUG-2
89.8
Yes
mPLUG-2: A Modularized Multi-modal Foundation Mo...
2023-02-01
Code
13
🍷MerlotReserve-Base (+Audio)
89.7
Yes
MERLOT Reserve: Neural Script Knowledge through ...
2022-01-07
-
14
🍷MerlotReserve-Large (no Audio)
89.4
Yes
MERLOT Reserve: Neural Script Knowledge through ...
2022-01-07
-
15
CoCa (finetuned)
89.4
Yes
CoCa: Contrastive Captioners are Image-Text Foun...
2022-05-04
Code
16
VideoMAE V2-g
88.8
Yes
VideoMAE V2: Scaling Video Masked Autoencoders w...
2023-03-29
Code
17
Hiera-H (no extra data)
88.8
No
Hiera: A Hierarchical Vision Transformer without...
2023-06-01
Code
18
CoCa (frozen)
88.5
Yes
CoCa: Contrastive Captioners are Image-Text Foun...
2022-05-04
Code
19
MaskFeat (no extra data, MViT-L)
88.3
No
Masked Feature Prediction for Self-Supervised Vi...
2021-12-16
Code
20
X-CLIP(ViT-L/14, CLIP)
88.3
Yes
Expanding Language-Image Pretrained Models for G...
2022-08-04
Code
21
🍷MerlotReserve-Base (no Audio)
88.1
Yes
MERLOT Reserve: Neural Script Knowledge through ...
2022-01-07
-
22
MViTv2-L (ImageNet-21k pretrain)
87.9
No
MViTv2: Improved Multiscale Vision Transformers ...
2021-12-02
Code
23
CoVeR (JFT-3B)
87.9
Yes
Co-training Transformer with Videos and Images I...
2021-12-14
-
24
Florence (curated FLD-900M pretrain)
87.8
Yes
Florence: A New Foundation Model for Computer Vi...
2021-11-22
Code
25
CoVeR (JFT-300M)
86.8
Yes
Co-training Transformer with Videos and Images I...
2021-12-14
-
26
TokenLearner 16at18 w. Fuser (L/10)
86.3
Yes
TokenLearner: What Can 8 Learned Tokens Do for I...
2021-06-21
Code
27
Swin-L (384x384, ImageNet-21k pretrain)
86.1
Yes
Video Swin Transformer
2021-06-24
Code
28
ViViT-H/16x2 (JFT)
85.8
Yes
ViViT: A Video Vision Transformer
2021-03-29
Code
29
MViTv2-L (train from scratch)
85.5
No
MViTv2: Improved Multiscale Vision Transformers ...
2021-12-02
Code
30
UniFormer-B (ImageNet-1K)
84.8
Yes
-
-
Code
31
XViT (x16)
84.5
No
Space-time Mixing Attention for Video Transformer
2021-06-10
Code
32
MoViNet-A5 (AutoAugment)
84.3
No
MoViNets: Mobile Video Networks for Efficient Vi...
2021-03-21
Code
33
ViViT-L/16x2
84.3
No
ViViT: A Video Vision Transformer
2021-03-29
Code
34
Swin-B (ImageNet-21k pretrain)
84
Yes
Video Swin Transformer
2021-06-24
Code
35
MViT-B-24, 32x3
83.8
No
Multiscale Vision Transformers
2021-04-22
Code
36
VATT-Large
83.6
Yes
VATT: Transformers for Multimodal Self-Supervise...
2021-04-22
Code
37
MoViNet-A6
83.5
No
MoViNets: Mobile Video Networks for Efficient Vi...
2021-03-21
Code
38
MViT-B, 32x3
83.4
No
Multiscale Vision Transformers
2021-04-22
Code
39
LGD-3D Two-stream
83.1
No
Learning Spatio-Temporal Representation with Loc...
2019-06-13
-
40
R3D-RS-200
83.1
No
Revisiting 3D ResNets for Video Recognition
2021-09-03
Code
41
ViViT-L/16x2 (320x320)
83
No
ViViT: A Video Vision Transformer
2021-03-29
Code
42
MoViNet-A5
82.7
No
MoViNets: Mobile Video Networks for Efficient Vi...
2021-03-21
Code
43
MViT-B, 16x4
82.1
No
Multiscale Vision Transformers
2021-04-22
Code
44
PERF-Net (distilled ResNet50-G)
82
No
PERF-Net: Pose Empowered RGB-Flow Net
2020-09-28
-
45
SlowFast 16x8 (ResNet-101 + NL)
81.8
No
SlowFast Networks for Video Recognition
2018-12-10
Code
46
LGD-3D RGB
81.5
No
Learning Spatio-Temporal Representation with Loc...
2019-06-13
-
47
MoViNet-A4
81.2
No
MoViNets: Mobile Video Networks for Efficient Vi...
2021-03-21
Code
48
SlowFast 16x8 (ResNet-101)
81.1
No
SlowFast Networks for Video Recognition
2018-12-10
Code
49
MoViNet-A3
80.8
No
MoViNets: Mobile Video Networks for Efficient Vi...
2021-03-21
Code
50
SlowFast 8x8 (ResNet-101)
80.4
No
SlowFast Networks for Video Recognition
2018-12-10
Code
51
SlowFast 8x8 (ResNet-50)
79.9
No
SlowFast Networks for Video Recognition
2018-12-10
Code
52
D3D+S3D-G
79.1
No
D3D: Distilled 3D Networks for Video Action Reco...
2018-12-19
Code
53
SlowFast 4x16 (ResNet-50)
78.8
No
SlowFast Networks for Video Recognition
2018-12-10
Code
54
S3D-G (RGB+Flow)
78.6
No
Rethinking Spatiotemporal Feature Learning: Spee...
2017-12-13
Code
55
D3D
77.9
No
D3D: Distilled 3D Networks for Video Action Reco...
2018-12-19
Code
56
MoViNet-A2
77.5
No
MoViNets: Mobile Video Networks for Efficient Vi...
2021-03-21
Code
57
S3D-G (RGB)
76.6
No
Rethinking Spatiotemporal Feature Learning: Spee...
2017-12-13
Code
58
MoViNet-A1
76
No
MoViNets: Mobile Video Networks for Efficient Vi...
2021-03-21
Code
59
LGD-3D Flow
75
No
Learning Spatio-Temporal Representation with Loc...
2019-06-13
-
60
I3D (RGB)
73.6
No
A Short Note about Kinetics-600
2018-08-03
Code
61
MoViNet-A0
71.5
No
MoViNets: Mobile Video Networks for Efficient Vi...
2021-03-21
Code
62
S3D-G (Flow)
69.7
No
Rethinking Spatiotemporal Feature Learning: Spee...
2017-12-13
Code
#1
InternVideo2-6B
SOTA
91.9
Top-1 Accuracy
· Extra Data
· 2024-03-22
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
Code
#2
TubeVit-H
SOTA
91.8
Top-1 Accuracy
· Extra Data
· 2022-12-06
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning
Code
#3
InternVideo2-1B
91.6
Top-1 Accuracy
· Extra Data
· 2024-03-22
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
Code
#4
TubeVit-L
91.5
Top-1 Accuracy
· Extra Data
· 2022-12-06
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning
Code
#5
InternVideo-T
91.3
Top-1 Accuracy
· Extra Data
· 2022-12-06
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
Code
#6
🍷MerlotReserve-Large (+Audio)
SOTA
91.1
Top-1 Accuracy
· Extra Data
· 2022-01-07
MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound
#7
TubeVit-B
90.9
Top-1 Accuracy
· Extra Data
· 2022-12-06
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning
Code
#8
UMT-L (ViT-L/16)
90.5
Top-1 Accuracy
· Extra Data
· 2023-03-28
Unmasked Teacher: Towards Training-Efficient Video Foundation Models
Code
#9
MTV-H (WTS 60M)
90.3
Top-1 Accuracy
· Extra Data
· 2022-01-12
Multiview Transformers for Video Recognition
Code
#10
UniFormerV2-L
90.1
Top-1 Accuracy
· Extra Data
No paper
Code
#11
VideoMAE V2-g (64x266x266)
89.9
Top-1 Accuracy
· Extra Data
· 2023-03-29
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
Code
#12
mPLUG-2
89.8
Top-1 Accuracy
· Extra Data
· 2023-02-01
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
Code
#13
🍷MerlotReserve-Base (+Audio)
89.7
Top-1 Accuracy
· Extra Data
· 2022-01-07
MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound
#14
🍷MerlotReserve-Large (no Audio)
89.4
Top-1 Accuracy
· Extra Data
· 2022-01-07
MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound
#15
CoCa (finetuned)
89.4
Top-1 Accuracy
· Extra Data
· 2022-05-04
CoCa: Contrastive Captioners are Image-Text Foundation Models
Code
#16
VideoMAE V2-g
88.8
Top-1 Accuracy
· Extra Data
· 2023-03-29
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
Code
#17
Hiera-H (no extra data)
88.8
Top-1 Accuracy
· 2023-06-01
Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles
Code
#18
CoCa (frozen)
88.5
Top-1 Accuracy
· Extra Data
· 2022-05-04
CoCa: Contrastive Captioners are Image-Text Foundation Models
Code
#19
MaskFeat (no extra data, MViT-L)
SOTA
88.3
Top-1 Accuracy
· 2021-12-16
Masked Feature Prediction for Self-Supervised Visual Pre-Training
Code
#20
X-CLIP(ViT-L/14, CLIP)
88.3
Top-1 Accuracy
· Extra Data
· 2022-08-04
Expanding Language-Image Pretrained Models for General Video Recognition
Code
#21
🍷MerlotReserve-Base (no Audio)
88.1
Top-1 Accuracy
· Extra Data
· 2022-01-07
MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound
#22
MViTv2-L (ImageNet-21k pretrain)
SOTA
87.9
Top-1 Accuracy
· 2021-12-02
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection
Code
#23
CoVeR (JFT-3B)
87.9
Top-1 Accuracy
· Extra Data
· 2021-12-14
Co-training Transformer with Videos and Images Improves Action Recognition
#24
Florence (curated FLD-900M pretrain)
SOTA
87.8
Top-1 Accuracy
· Extra Data
· 2021-11-22
Florence: A New Foundation Model for Computer Vision
Code
#25
CoVeR (JFT-300M)
86.8
Top-1 Accuracy
· Extra Data
· 2021-12-14
Co-training Transformer with Videos and Images Improves Action Recognition
#26
TokenLearner 16at18 w. Fuser (L/10)
SOTA
86.3
Top-1 Accuracy
· Extra Data
· 2021-06-21
TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?
Code
#27
Swin-L (384x384, ImageNet-21k pretrain)
86.1
Top-1 Accuracy
· Extra Data
· 2021-06-24
Video Swin Transformer
Code
#28
ViViT-H/16x2 (JFT)
SOTA
85.8
Top-1 Accuracy
· Extra Data
· 2021-03-29
ViViT: A Video Vision Transformer
Code
#29
MViTv2-L (train from scratch)
85.5
Top-1 Accuracy
· 2021-12-02
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection
Code
#30
UniFormer-B (ImageNet-1K)
84.8
Top-1 Accuracy
· Extra Data
No paper
Code
#31
XViT (x16)
84.5
Top-1 Accuracy
· 2021-06-10
Space-time Mixing Attention for Video Transformer
Code
#32
MoViNet-A5 (AutoAugment)
SOTA
84.3
Top-1 Accuracy
· 2021-03-21
MoViNets: Mobile Video Networks for Efficient Video Recognition
Code
#33
ViViT-L/16x2
84.3
Top-1 Accuracy
· 2021-03-29
ViViT: A Video Vision Transformer
Code
#34
Swin-B (ImageNet-21k pretrain)
84
Top-1 Accuracy
· Extra Data
· 2021-06-24
Video Swin Transformer
Code
#35
MViT-B-24, 32x3
83.8
Top-1 Accuracy
· 2021-04-22
Multiscale Vision Transformers
Code
#36
VATT-Large
83.6
Top-1 Accuracy
· Extra Data
· 2021-04-22
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
Code
#37
MoViNet-A6
83.5
Top-1 Accuracy
· 2021-03-21
MoViNets: Mobile Video Networks for Efficient Video Recognition
Code
#38
MViT-B, 32x3
83.4
Top-1 Accuracy
· 2021-04-22
Multiscale Vision Transformers
Code
#39
LGD-3D Two-stream
SOTA
83.1
Top-1 Accuracy
· 2019-06-13
Learning Spatio-Temporal Representation with Local and Global Diffusion
#40
R3D-RS-200
83.1
Top-1 Accuracy
· 2021-09-03
Revisiting 3D ResNets for Video Recognition
Code
#41
ViViT-L/16x2 (320x320)
83
Top-1 Accuracy
· 2021-03-29
ViViT: A Video Vision Transformer
Code
#42
MoViNet-A5
82.7
Top-1 Accuracy
· 2021-03-21
MoViNets: Mobile Video Networks for Efficient Video Recognition
Code
#43
MViT-B, 16x4
82.1
Top-1 Accuracy
· 2021-04-22
Multiscale Vision Transformers
Code
#44
PERF-Net (distilled ResNet50-G)
82
Top-1 Accuracy
· 2020-09-28
PERF-Net: Pose Empowered RGB-Flow Net
#45
SlowFast 16x8 (ResNet-101 + NL)
SOTA
81.8
Top-1 Accuracy
· 2018-12-10
SlowFast Networks for Video Recognition
Code
#46
LGD-3D RGB
81.5
Top-1 Accuracy
· 2019-06-13
Learning Spatio-Temporal Representation with Local and Global Diffusion
#47
MoViNet-A4
81.2
Top-1 Accuracy
· 2021-03-21
MoViNets: Mobile Video Networks for Efficient Video Recognition
Code
#48
SlowFast 16x8 (ResNet-101)
81.1
Top-1 Accuracy
· 2018-12-10
SlowFast Networks for Video Recognition
Code
#49
MoViNet-A3
80.8
Top-1 Accuracy
· 2021-03-21
MoViNets: Mobile Video Networks for Efficient Video Recognition
Code
#50
SlowFast 8x8 (ResNet-101)
80.4
Top-1 Accuracy
· 2018-12-10
SlowFast Networks for Video Recognition
Code
#51
SlowFast 8x8 (ResNet-50)
79.9
Top-1 Accuracy
· 2018-12-10
SlowFast Networks for Video Recognition
Code
#52
D3D+S3D-G
79.1
Top-1 Accuracy
· 2018-12-19
D3D: Distilled 3D Networks for Video Action Recognition
Code
#53
SlowFast 4x16 (ResNet-50)
78.8
Top-1 Accuracy
· 2018-12-10
SlowFast Networks for Video Recognition
Code
#54
S3D-G (RGB+Flow)
SOTA
78.6
Top-1 Accuracy
· 2017-12-13
Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification
Code
#55
D3D
77.9
Top-1 Accuracy
· 2018-12-19
D3D: Distilled 3D Networks for Video Action Recognition
Code
#56
MoViNet-A2
77.5
Top-1 Accuracy
· 2021-03-21
MoViNets: Mobile Video Networks for Efficient Video Recognition
Code
#57
S3D-G (RGB)
76.6
Top-1 Accuracy
· 2017-12-13
Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification
Code
#58
MoViNet-A1
76
Top-1 Accuracy
· 2021-03-21
MoViNets: Mobile Video Networks for Efficient Video Recognition
Code
#59
LGD-3D Flow
75
Top-1 Accuracy
· 2019-06-13
Learning Spatio-Temporal Representation with Local and Global Diffusion
#60
I3D (RGB)
73.6
Top-1 Accuracy
· 2018-08-03
A Short Note about Kinetics-600
Code
#61
MoViNet-A0
71.5
Top-1 Accuracy
· 2021-03-21
MoViNets: Mobile Video Networks for Efficient Video Recognition
Code
#62
S3D-G (Flow)
69.7
Top-1 Accuracy
· 2017-12-13
Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification
Code