Video on Kinetics-600

Metric: Top-5 Accuracy (higher is better)

LeaderboardDataset

Loading chart...

Results

Hide extra data

Sort:

#	Model↕	Top-5 Accuracy▼	Extra Data	Paper	Date↕	Code
1	TubeVit-H	98.9	Yes	Rethinking Video ViTs: Sparse Video Tubes for Jo...	2022-12-06	Code
2	UMT-L (ViT-L/16)	98.8	Yes	Unmasked Teacher: Towards Training-Efficient Vid...	2023-03-28	Code
3	TubeVit-L	98.7	Yes	Rethinking Video ViTs: Sparse Video Tubes for Jo...	2022-12-06	Code
4	MTV-H (WTS 60M)	98.5	Yes	Multiview Transformers for Video Recognition	2022-01-12	Code
5	UniFormerV2-L	98.5	Yes	-	-	Code
6	VideoMAE V2-g (64x266x266)	98.5	Yes	VideoMAE V2: Scaling Video Masked Autoencoders w...	2023-03-29	Code
7	mPLUG-2	98.3	Yes	mPLUG-2: A Modularized Multi-modal Foundation Mo...	2023-02-01	Code
8	VideoMAE V2-g	98.2	Yes	VideoMAE V2: Scaling Video Masked Autoencoders w...	2023-03-29	Code
9	MaskFeat (no extra data, MViT-L)	98	No	Masked Feature Prediction for Self-Supervised Vi...	2021-12-16	Code
10	MViTv2-L (ImageNet-21k pretrain)	97.9	No	MViTv2: Improved Multiscale Vision Transformers ...	2021-12-02	Code
11	Florence (curated FLD-900M pretrain)	97.9	Yes	Florence: A New Foundation Model for Computer Vi...	2021-11-22	Code
12	CoVeR (JFT-3B)	97.8	Yes	Co-training Transformer with Videos and Images I...	2021-12-14	-
13	X-CLIP(ViT-L/14, CLIP)	97.7	Yes	Expanding Language-Image Pretrained Models for G...	2022-08-04	Code
14	TubeVit-B	97.3	Yes	Rethinking Video ViTs: Sparse Video Tubes for Jo...	2022-12-06	Code
15	CoVeR (JFT-300M)	97.3	Yes	Co-training Transformer with Videos and Images I...	2021-12-14	-
16	Swin-L (384x384, ImageNet-21k pretrain)	97.3	Yes	Video Swin Transformer	2021-06-24	Code
17	MViTv2-B (train from scratch)	97.2	No	MViTv2: Improved Multiscale Vision Transformers ...	2021-12-02	Code
18	🍷MerlotReserve-Large (+Audio)	97.1	Yes	MERLOT Reserve: Neural Script Knowledge through ...	2022-01-07	-
19	TokenLearner 16at18 w. Fuser (L/10)	97	Yes	TokenLearner: What Can 8 Learned Tokens Do for I...	2021-06-21	Code
20	UniFormer-B (ImageNet-1K)	96.7	Yes	-	-	Code
21	🍷MerlotReserve-Base (+Audio)	96.6	Yes	MERLOT Reserve: Neural Script Knowledge through ...	2022-01-07	-
22	VATT-Large	96.6	Yes	VATT: Transformers for Multimodal Self-Supervise...	2021-04-22	Code
23	ViViT-H/16x2 (JFT)	96.5	Yes	ViViT: A Video Vision Transformer	2021-03-29	Code
24	Swin-B (ImageNet-21k pretrain)	96.5	Yes	Video Swin Transformer	2021-06-24	Code
25	MoViNet-A6	96.5	No	MoViNets: Mobile Video Networks for Efficient Vi...	2021-03-21	Code
26	MoViNet-A5 (AutoAugment)	96.4	No	MoViNets: Mobile Video Networks for Efficient Vi...	2021-03-21	Code
27	🍷MerlotReserve-Large (no Audio)	96.3	Yes	MERLOT Reserve: Neural Script Knowledge through ...	2022-01-07	-
28	XViT (x16)	96.3	No	Space-time Mixing Attention for Video Transformer	2021-06-10	Code
29	MViT-B-24, 32x3	96.3	No	Multiscale Vision Transformers	2021-04-22	Code
30	MViT-B, 32x3	96.3	No	Multiscale Vision Transformers	2021-04-22	Code
31	LGD-3D Two-stream	96.2	No	Learning Spatio-Temporal Representation with Loc...	2019-06-13	-
32	🍷MerlotReserve-Base (no Audio)	95.8	Yes	MERLOT Reserve: Neural Script Knowledge through ...	2022-01-07	-
33	ViViT-L/16x2 (320x320)	95.7	No	ViViT: A Video Vision Transformer	2021-03-29	Code
34	MoViNet-A5	95.7	No	MoViNets: Mobile Video Networks for Efficient Vi...	2021-03-21	Code
35	MViT-B, 16x4	95.7	No	Multiscale Vision Transformers	2021-04-22	Code
36	PERF-Net (distilled ResNet50-G)	95.7	No	PERF-Net: Pose Empowered RGB-Flow Net	2020-09-28	-
37	ViViT-L/16x2	95.6	No	ViViT: A Video Vision Transformer	2021-03-29	Code
38	LGD-3D RGB	95.6	No	Learning Spatio-Temporal Representation with Loc...	2019-06-13	-
39	SlowFast 16x8 (ResNet-101 + NL)	95.1	No	SlowFast Networks for Video Recognition	2018-12-10	Code
40	SlowFast 16x8 (ResNet-101)	95.1	No	SlowFast Networks for Video Recognition	2018-12-10	Code
41	MoViNet-A4	94.9	No	MoViNets: Mobile Video Networks for Efficient Vi...	2021-03-21	Code
42	SlowFast 8x8 (ResNet-101)	94.8	No	SlowFast Networks for Video Recognition	2018-12-10	Code
43	SlowFast 8x8 (ResNet-50)	94.5	No	SlowFast Networks for Video Recognition	2018-12-10	Code
44	SlowFast 4x16 (ResNet-50)	94	No	SlowFast Networks for Video Recognition	2018-12-10	Code
45	MoViNet-A2	93.4	No	MoViNets: Mobile Video Networks for Efficient Vi...	2021-03-21	Code
46	MoViNet-A1	92.6	No	MoViNets: Mobile Video Networks for Efficient Vi...	2021-03-21	Code
47	LGD-3D Flow	92.4	No	Learning Spatio-Temporal Representation with Loc...	2019-06-13	-
48	MoViNet-A0	90.4	No	MoViNets: Mobile Video Networks for Efficient Vi...	2021-03-21	Code
49	MoViNet-A3	80.8	No	MoViNets: Mobile Video Networks for Efficient Vi...	2021-03-21	Code

#1TubeVit-HSOTA
98.9
Top-5 Accuracy· Extra Data· 2022-12-06
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning Code
#2UMT-L (ViT-L/16)
98.8
Top-5 Accuracy· Extra Data· 2023-03-28
Unmasked Teacher: Towards Training-Efficient Video Foundation Models Code
#3TubeVit-L
98.7
Top-5 Accuracy· Extra Data· 2022-12-06
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning Code
#4MTV-H (WTS 60M)SOTA
98.5
Top-5 Accuracy· Extra Data· 2022-01-12
Multiview Transformers for Video Recognition Code
#5UniFormerV2-L
98.5
Top-5 Accuracy· Extra Data
No paperCode
#6VideoMAE V2-g (64x266x266)
98.5
Top-5 Accuracy· Extra Data· 2023-03-29
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking Code
#7mPLUG-2
98.3
Top-5 Accuracy· Extra Data· 2023-02-01
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video Code
#8VideoMAE V2-g
98.2
Top-5 Accuracy· Extra Data· 2023-03-29
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking Code
#9MaskFeat (no extra data, MViT-L)SOTA
98
Top-5 Accuracy· 2021-12-16
Masked Feature Prediction for Self-Supervised Visual Pre-Training Code
#10MViTv2-L (ImageNet-21k pretrain)
97.9
Top-5 Accuracy· 2021-12-02
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection Code
#11Florence (curated FLD-900M pretrain)SOTA
97.9
Top-5 Accuracy· Extra Data· 2021-11-22
Florence: A New Foundation Model for Computer Vision Code
#12CoVeR (JFT-3B)
97.8
Top-5 Accuracy· Extra Data· 2021-12-14
Co-training Transformer with Videos and Images Improves Action Recognition
#13X-CLIP(ViT-L/14, CLIP)
97.7
Top-5 Accuracy· Extra Data· 2022-08-04
Expanding Language-Image Pretrained Models for General Video Recognition Code
#14TubeVit-B
97.3
Top-5 Accuracy· Extra Data· 2022-12-06
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning Code
#15CoVeR (JFT-300M)
97.3
Top-5 Accuracy· Extra Data· 2021-12-14
Co-training Transformer with Videos and Images Improves Action Recognition
#16Swin-L (384x384, ImageNet-21k pretrain)SOTA
97.3
Top-5 Accuracy· Extra Data· 2021-06-24
Video Swin Transformer Code
#17MViTv2-B (train from scratch)
97.2
Top-5 Accuracy· 2021-12-02
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection Code
#18🍷MerlotReserve-Large (+Audio)
97.1
Top-5 Accuracy· Extra Data· 2022-01-07
MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound
#19TokenLearner 16at18 w. Fuser (L/10)SOTA
97
Top-5 Accuracy· Extra Data· 2021-06-21
TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?Code
#20UniFormer-B (ImageNet-1K)
96.7
Top-5 Accuracy· Extra Data
No paperCode
#21🍷MerlotReserve-Base (+Audio)
96.6
Top-5 Accuracy· Extra Data· 2022-01-07
MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound
#22VATT-LargeSOTA
96.6
Top-5 Accuracy· Extra Data· 2021-04-22
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text Code
#23ViViT-H/16x2 (JFT)
96.5
Top-5 Accuracy· Extra Data· 2021-03-29
ViViT: A Video Vision Transformer Code
#24Swin-B (ImageNet-21k pretrain)
96.5
Top-5 Accuracy· Extra Data· 2021-06-24
Video Swin Transformer Code
#25MoViNet-A6SOTA
96.5
Top-5 Accuracy· 2021-03-21
MoViNets: Mobile Video Networks for Efficient Video Recognition Code
#26MoViNet-A5 (AutoAugment)
96.4
Top-5 Accuracy· 2021-03-21
MoViNets: Mobile Video Networks for Efficient Video Recognition Code
#27🍷MerlotReserve-Large (no Audio)
96.3
Top-5 Accuracy· Extra Data· 2022-01-07
MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound
#28XViT (x16)
96.3
Top-5 Accuracy· 2021-06-10
Space-time Mixing Attention for Video Transformer Code
#29MViT-B-24, 32x3
96.3
Top-5 Accuracy· 2021-04-22
Multiscale Vision Transformers Code
#30MViT-B, 32x3
96.3
Top-5 Accuracy· 2021-04-22
Multiscale Vision Transformers Code
#31LGD-3D Two-streamSOTA
96.2
Top-5 Accuracy· 2019-06-13
Learning Spatio-Temporal Representation with Local and Global Diffusion
#32🍷MerlotReserve-Base (no Audio)
95.8
Top-5 Accuracy· Extra Data· 2022-01-07
MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound
#33ViViT-L/16x2 (320x320)
95.7
Top-5 Accuracy· 2021-03-29
ViViT: A Video Vision Transformer Code
#34MoViNet-A5
95.7
Top-5 Accuracy· 2021-03-21
MoViNets: Mobile Video Networks for Efficient Video Recognition Code
#35MViT-B, 16x4
95.7
Top-5 Accuracy· 2021-04-22
Multiscale Vision Transformers Code
#36PERF-Net (distilled ResNet50-G)
95.7
Top-5 Accuracy· 2020-09-28
PERF-Net: Pose Empowered RGB-Flow Net
#37ViViT-L/16x2
95.6
Top-5 Accuracy· 2021-03-29
ViViT: A Video Vision Transformer Code
#38LGD-3D RGB
95.6
Top-5 Accuracy· 2019-06-13
Learning Spatio-Temporal Representation with Local and Global Diffusion
#39SlowFast 16x8 (ResNet-101 + NL)SOTA
95.1
Top-5 Accuracy· 2018-12-10
SlowFast Networks for Video Recognition Code
#40SlowFast 16x8 (ResNet-101)
95.1
Top-5 Accuracy· 2018-12-10
SlowFast Networks for Video Recognition Code
#41MoViNet-A4
94.9
Top-5 Accuracy· 2021-03-21
MoViNets: Mobile Video Networks for Efficient Video Recognition Code
#42SlowFast 8x8 (ResNet-101)
94.8
Top-5 Accuracy· 2018-12-10
SlowFast Networks for Video Recognition Code
#43SlowFast 8x8 (ResNet-50)
94.5
Top-5 Accuracy· 2018-12-10
SlowFast Networks for Video Recognition Code
#44SlowFast 4x16 (ResNet-50)
94
Top-5 Accuracy· 2018-12-10
SlowFast Networks for Video Recognition Code
#45MoViNet-A2
93.4
Top-5 Accuracy· 2021-03-21
MoViNets: Mobile Video Networks for Efficient Video Recognition Code
#46MoViNet-A1
92.6
Top-5 Accuracy· 2021-03-21
MoViNets: Mobile Video Networks for Efficient Video Recognition Code
#47LGD-3D Flow
92.4
Top-5 Accuracy· 2019-06-13
Learning Spatio-Temporal Representation with Local and Global Diffusion
#48MoViNet-A0
90.4
Top-5 Accuracy· 2021-03-21
MoViNets: Mobile Video Networks for Efficient Video Recognition Code
#49MoViNet-A3
80.8
Top-5 Accuracy· 2021-03-21
MoViNets: Mobile Video Networks for Efficient Video Recognition Code