Video on Kinetics-600

Metric: Top-1 Accuracy (higher is better)

LeaderboardDataset

Loading chart...

Results

Hide extra data

Sort:

#	Model↕	Top-1 Accuracy▼	Extra Data	Paper	Date↕	Code
1	InternVideo2-6B	91.9	Yes	InternVideo2: Scaling Foundation Models for Mult...	2024-03-22	Code
2	TubeVit-H	91.8	Yes	Rethinking Video ViTs: Sparse Video Tubes for Jo...	2022-12-06	Code
3	InternVideo2-1B	91.6	Yes	InternVideo2: Scaling Foundation Models for Mult...	2024-03-22	Code
4	TubeVit-L	91.5	Yes	Rethinking Video ViTs: Sparse Video Tubes for Jo...	2022-12-06	Code
5	InternVideo-T	91.3	Yes	InternVideo: General Video Foundation Models via...	2022-12-06	Code
6	🍷MerlotReserve-Large (+Audio)	91.1	Yes	MERLOT Reserve: Neural Script Knowledge through ...	2022-01-07	-
7	TubeVit-B	90.9	Yes	Rethinking Video ViTs: Sparse Video Tubes for Jo...	2022-12-06	Code
8	UMT-L (ViT-L/16)	90.5	Yes	Unmasked Teacher: Towards Training-Efficient Vid...	2023-03-28	Code
9	MTV-H (WTS 60M)	90.3	Yes	Multiview Transformers for Video Recognition	2022-01-12	Code
10	UniFormerV2-L	90.1	Yes	-	-	Code
11	VideoMAE V2-g (64x266x266)	89.9	Yes	VideoMAE V2: Scaling Video Masked Autoencoders w...	2023-03-29	Code
12	mPLUG-2	89.8	Yes	mPLUG-2: A Modularized Multi-modal Foundation Mo...	2023-02-01	Code
13	🍷MerlotReserve-Base (+Audio)	89.7	Yes	MERLOT Reserve: Neural Script Knowledge through ...	2022-01-07	-
14	🍷MerlotReserve-Large (no Audio)	89.4	Yes	MERLOT Reserve: Neural Script Knowledge through ...	2022-01-07	-
15	CoCa (finetuned)	89.4	Yes	CoCa: Contrastive Captioners are Image-Text Foun...	2022-05-04	Code
16	VideoMAE V2-g	88.8	Yes	VideoMAE V2: Scaling Video Masked Autoencoders w...	2023-03-29	Code
17	Hiera-H (no extra data)	88.8	No	Hiera: A Hierarchical Vision Transformer without...	2023-06-01	Code
18	CoCa (frozen)	88.5	Yes	CoCa: Contrastive Captioners are Image-Text Foun...	2022-05-04	Code
19	MaskFeat (no extra data, MViT-L)	88.3	No	Masked Feature Prediction for Self-Supervised Vi...	2021-12-16	Code
20	X-CLIP(ViT-L/14, CLIP)	88.3	Yes	Expanding Language-Image Pretrained Models for G...	2022-08-04	Code
21	🍷MerlotReserve-Base (no Audio)	88.1	Yes	MERLOT Reserve: Neural Script Knowledge through ...	2022-01-07	-
22	MViTv2-L (ImageNet-21k pretrain)	87.9	No	MViTv2: Improved Multiscale Vision Transformers ...	2021-12-02	Code
23	CoVeR (JFT-3B)	87.9	Yes	Co-training Transformer with Videos and Images I...	2021-12-14	-
24	Florence (curated FLD-900M pretrain)	87.8	Yes	Florence: A New Foundation Model for Computer Vi...	2021-11-22	Code
25	CoVeR (JFT-300M)	86.8	Yes	Co-training Transformer with Videos and Images I...	2021-12-14	-
26	TokenLearner 16at18 w. Fuser (L/10)	86.3	Yes	TokenLearner: What Can 8 Learned Tokens Do for I...	2021-06-21	Code
27	Swin-L (384x384, ImageNet-21k pretrain)	86.1	Yes	Video Swin Transformer	2021-06-24	Code
28	ViViT-H/16x2 (JFT)	85.8	Yes	ViViT: A Video Vision Transformer	2021-03-29	Code
29	MViTv2-L (train from scratch)	85.5	No	MViTv2: Improved Multiscale Vision Transformers ...	2021-12-02	Code
30	UniFormer-B (ImageNet-1K)	84.8	Yes	-	-	Code
31	XViT (x16)	84.5	No	Space-time Mixing Attention for Video Transformer	2021-06-10	Code
32	MoViNet-A5 (AutoAugment)	84.3	No	MoViNets: Mobile Video Networks for Efficient Vi...	2021-03-21	Code
33	ViViT-L/16x2	84.3	No	ViViT: A Video Vision Transformer	2021-03-29	Code
34	Swin-B (ImageNet-21k pretrain)	84	Yes	Video Swin Transformer	2021-06-24	Code
35	MViT-B-24, 32x3	83.8	No	Multiscale Vision Transformers	2021-04-22	Code
36	VATT-Large	83.6	Yes	VATT: Transformers for Multimodal Self-Supervise...	2021-04-22	Code
37	MoViNet-A6	83.5	No	MoViNets: Mobile Video Networks for Efficient Vi...	2021-03-21	Code
38	MViT-B, 32x3	83.4	No	Multiscale Vision Transformers	2021-04-22	Code
39	LGD-3D Two-stream	83.1	No	Learning Spatio-Temporal Representation with Loc...	2019-06-13	-
40	R3D-RS-200	83.1	No	Revisiting 3D ResNets for Video Recognition	2021-09-03	Code
41	ViViT-L/16x2 (320x320)	83	No	ViViT: A Video Vision Transformer	2021-03-29	Code
42	MoViNet-A5	82.7	No	MoViNets: Mobile Video Networks for Efficient Vi...	2021-03-21	Code
43	MViT-B, 16x4	82.1	No	Multiscale Vision Transformers	2021-04-22	Code
44	PERF-Net (distilled ResNet50-G)	82	No	PERF-Net: Pose Empowered RGB-Flow Net	2020-09-28	-
45	SlowFast 16x8 (ResNet-101 + NL)	81.8	No	SlowFast Networks for Video Recognition	2018-12-10	Code
46	LGD-3D RGB	81.5	No	Learning Spatio-Temporal Representation with Loc...	2019-06-13	-
47	MoViNet-A4	81.2	No	MoViNets: Mobile Video Networks for Efficient Vi...	2021-03-21	Code
48	SlowFast 16x8 (ResNet-101)	81.1	No	SlowFast Networks for Video Recognition	2018-12-10	Code
49	MoViNet-A3	80.8	No	MoViNets: Mobile Video Networks for Efficient Vi...	2021-03-21	Code
50	SlowFast 8x8 (ResNet-101)	80.4	No	SlowFast Networks for Video Recognition	2018-12-10	Code
51	SlowFast 8x8 (ResNet-50)	79.9	No	SlowFast Networks for Video Recognition	2018-12-10	Code
52	D3D+S3D-G	79.1	No	D3D: Distilled 3D Networks for Video Action Reco...	2018-12-19	Code
53	SlowFast 4x16 (ResNet-50)	78.8	No	SlowFast Networks for Video Recognition	2018-12-10	Code
54	S3D-G (RGB+Flow)	78.6	No	Rethinking Spatiotemporal Feature Learning: Spee...	2017-12-13	Code
55	D3D	77.9	No	D3D: Distilled 3D Networks for Video Action Reco...	2018-12-19	Code
56	MoViNet-A2	77.5	No	MoViNets: Mobile Video Networks for Efficient Vi...	2021-03-21	Code
57	S3D-G (RGB)	76.6	No	Rethinking Spatiotemporal Feature Learning: Spee...	2017-12-13	Code
58	MoViNet-A1	76	No	MoViNets: Mobile Video Networks for Efficient Vi...	2021-03-21	Code
59	LGD-3D Flow	75	No	Learning Spatio-Temporal Representation with Loc...	2019-06-13	-
60	I3D (RGB)	73.6	No	A Short Note about Kinetics-600	2018-08-03	Code
61	MoViNet-A0	71.5	No	MoViNets: Mobile Video Networks for Efficient Vi...	2021-03-21	Code
62	S3D-G (Flow)	69.7	No	Rethinking Spatiotemporal Feature Learning: Spee...	2017-12-13	Code

#1InternVideo2-6BSOTA
91.9
Top-1 Accuracy· Extra Data· 2024-03-22
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding Code
#2TubeVit-HSOTA
91.8
Top-1 Accuracy· Extra Data· 2022-12-06
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning Code
#3InternVideo2-1B
91.6
Top-1 Accuracy· Extra Data· 2024-03-22
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding Code
#4TubeVit-L
91.5
Top-1 Accuracy· Extra Data· 2022-12-06
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning Code
#5InternVideo-T
91.3
Top-1 Accuracy· Extra Data· 2022-12-06
InternVideo: General Video Foundation Models via Generative and Discriminative Learning Code
#6🍷MerlotReserve-Large (+Audio)SOTA
91.1
Top-1 Accuracy· Extra Data· 2022-01-07
MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound
#7TubeVit-B
90.9
Top-1 Accuracy· Extra Data· 2022-12-06
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning Code
#8UMT-L (ViT-L/16)
90.5
Top-1 Accuracy· Extra Data· 2023-03-28
Unmasked Teacher: Towards Training-Efficient Video Foundation Models Code
#9MTV-H (WTS 60M)
90.3
Top-1 Accuracy· Extra Data· 2022-01-12
Multiview Transformers for Video Recognition Code
#10UniFormerV2-L
90.1
Top-1 Accuracy· Extra Data
No paperCode
#11VideoMAE V2-g (64x266x266)
89.9
Top-1 Accuracy· Extra Data· 2023-03-29
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking Code
#12mPLUG-2
89.8
Top-1 Accuracy· Extra Data· 2023-02-01
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video Code
#13🍷MerlotReserve-Base (+Audio)
89.7
Top-1 Accuracy· Extra Data· 2022-01-07
MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound
#14🍷MerlotReserve-Large (no Audio)
89.4
Top-1 Accuracy· Extra Data· 2022-01-07
MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound
#15CoCa (finetuned)
89.4
Top-1 Accuracy· Extra Data· 2022-05-04
CoCa: Contrastive Captioners are Image-Text Foundation Models Code
#16VideoMAE V2-g
88.8
Top-1 Accuracy· Extra Data· 2023-03-29
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking Code
#17Hiera-H (no extra data)
88.8
Top-1 Accuracy· 2023-06-01
Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles Code
#18CoCa (frozen)
88.5
Top-1 Accuracy· Extra Data· 2022-05-04
CoCa: Contrastive Captioners are Image-Text Foundation Models Code
#19MaskFeat (no extra data, MViT-L)SOTA
88.3
Top-1 Accuracy· 2021-12-16
Masked Feature Prediction for Self-Supervised Visual Pre-Training Code
#20X-CLIP(ViT-L/14, CLIP)
88.3
Top-1 Accuracy· Extra Data· 2022-08-04
Expanding Language-Image Pretrained Models for General Video Recognition Code
#21🍷MerlotReserve-Base (no Audio)
88.1
Top-1 Accuracy· Extra Data· 2022-01-07
MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound
#22MViTv2-L (ImageNet-21k pretrain)SOTA
87.9
Top-1 Accuracy· 2021-12-02
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection Code
#23CoVeR (JFT-3B)
87.9
Top-1 Accuracy· Extra Data· 2021-12-14
Co-training Transformer with Videos and Images Improves Action Recognition
#24Florence (curated FLD-900M pretrain)SOTA
87.8
Top-1 Accuracy· Extra Data· 2021-11-22
Florence: A New Foundation Model for Computer Vision Code
#25CoVeR (JFT-300M)
86.8
Top-1 Accuracy· Extra Data· 2021-12-14
Co-training Transformer with Videos and Images Improves Action Recognition
#26TokenLearner 16at18 w. Fuser (L/10)SOTA
86.3
Top-1 Accuracy· Extra Data· 2021-06-21
TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?Code
#27Swin-L (384x384, ImageNet-21k pretrain)
86.1
Top-1 Accuracy· Extra Data· 2021-06-24
Video Swin Transformer Code
#28ViViT-H/16x2 (JFT)SOTA
85.8
Top-1 Accuracy· Extra Data· 2021-03-29
ViViT: A Video Vision Transformer Code
#29MViTv2-L (train from scratch)
85.5
Top-1 Accuracy· 2021-12-02
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection Code
#30UniFormer-B (ImageNet-1K)
84.8
Top-1 Accuracy· Extra Data
No paperCode
#31XViT (x16)
84.5
Top-1 Accuracy· 2021-06-10
Space-time Mixing Attention for Video Transformer Code
#32MoViNet-A5 (AutoAugment)SOTA
84.3
Top-1 Accuracy· 2021-03-21
MoViNets: Mobile Video Networks for Efficient Video Recognition Code
#33ViViT-L/16x2
84.3
Top-1 Accuracy· 2021-03-29
ViViT: A Video Vision Transformer Code
#34Swin-B (ImageNet-21k pretrain)
84
Top-1 Accuracy· Extra Data· 2021-06-24
Video Swin Transformer Code
#35MViT-B-24, 32x3
83.8
Top-1 Accuracy· 2021-04-22
Multiscale Vision Transformers Code
#36VATT-Large
83.6
Top-1 Accuracy· Extra Data· 2021-04-22
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text Code
#37MoViNet-A6
83.5
Top-1 Accuracy· 2021-03-21
MoViNets: Mobile Video Networks for Efficient Video Recognition Code
#38MViT-B, 32x3
83.4
Top-1 Accuracy· 2021-04-22
Multiscale Vision Transformers Code
#39LGD-3D Two-streamSOTA
83.1
Top-1 Accuracy· 2019-06-13
Learning Spatio-Temporal Representation with Local and Global Diffusion
#40R3D-RS-200
83.1
Top-1 Accuracy· 2021-09-03
Revisiting 3D ResNets for Video Recognition Code
#41ViViT-L/16x2 (320x320)
83
Top-1 Accuracy· 2021-03-29
ViViT: A Video Vision Transformer Code
#42MoViNet-A5
82.7
Top-1 Accuracy· 2021-03-21
MoViNets: Mobile Video Networks for Efficient Video Recognition Code
#43MViT-B, 16x4
82.1
Top-1 Accuracy· 2021-04-22
Multiscale Vision Transformers Code
#44PERF-Net (distilled ResNet50-G)
82
Top-1 Accuracy· 2020-09-28
PERF-Net: Pose Empowered RGB-Flow Net
#45SlowFast 16x8 (ResNet-101 + NL)SOTA
81.8
Top-1 Accuracy· 2018-12-10
SlowFast Networks for Video Recognition Code
#46LGD-3D RGB
81.5
Top-1 Accuracy· 2019-06-13
Learning Spatio-Temporal Representation with Local and Global Diffusion
#47MoViNet-A4
81.2
Top-1 Accuracy· 2021-03-21
MoViNets: Mobile Video Networks for Efficient Video Recognition Code
#48SlowFast 16x8 (ResNet-101)
81.1
Top-1 Accuracy· 2018-12-10
SlowFast Networks for Video Recognition Code
#49MoViNet-A3
80.8
Top-1 Accuracy· 2021-03-21
MoViNets: Mobile Video Networks for Efficient Video Recognition Code
#50SlowFast 8x8 (ResNet-101)
80.4
Top-1 Accuracy· 2018-12-10
SlowFast Networks for Video Recognition Code
#51SlowFast 8x8 (ResNet-50)
79.9
Top-1 Accuracy· 2018-12-10
SlowFast Networks for Video Recognition Code
#52D3D+S3D-G
79.1
Top-1 Accuracy· 2018-12-19
D3D: Distilled 3D Networks for Video Action Recognition Code
#53SlowFast 4x16 (ResNet-50)
78.8
Top-1 Accuracy· 2018-12-10
SlowFast Networks for Video Recognition Code
#54S3D-G (RGB+Flow)SOTA
78.6
Top-1 Accuracy· 2017-12-13
Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification Code
#55D3D
77.9
Top-1 Accuracy· 2018-12-19
D3D: Distilled 3D Networks for Video Action Recognition Code
#56MoViNet-A2
77.5
Top-1 Accuracy· 2021-03-21
MoViNets: Mobile Video Networks for Efficient Video Recognition Code
#57S3D-G (RGB)
76.6
Top-1 Accuracy· 2017-12-13
Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification Code
#58MoViNet-A1
76
Top-1 Accuracy· 2021-03-21
MoViNets: Mobile Video Networks for Efficient Video Recognition Code
#59LGD-3D Flow
75
Top-1 Accuracy· 2019-06-13
Learning Spatio-Temporal Representation with Local and Global Diffusion
#60I3D (RGB)
73.6
Top-1 Accuracy· 2018-08-03
A Short Note about Kinetics-600 Code
#61MoViNet-A0
71.5
Top-1 Accuracy· 2021-03-21
MoViNets: Mobile Video Networks for Efficient Video Recognition Code
#62S3D-G (Flow)
69.7
Top-1 Accuracy· 2017-12-13
Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification Code