Video on MiT

Metric: Top 5 Accuracy (higher is better)

LeaderboardDataset

Loading chart...

Results

Hide extra data

Sort:

#	Model↕	Top 5 Accuracy▼	Extra Data	Paper	Date↕	Code
1	UMT-L (ViT-L/16)	78.2	Yes	Unmasked Teacher: Towards Training-Efficient Vid...	2023-03-28	Code
2	UniFormerV2-L	76.9	Yes	-	-	Code
3	MTV-H (WTS 60M)	75.7	Yes	Multiview Transformers for Video Recognition	2022-01-12	Code
4	CoVeR(JFT-3B)	75.4	Yes	Co-training Transformer with Videos and Images I...	2021-12-14	-
5	CoVeR(JFT-300M)	73.9	Yes	Co-training Transformer with Videos and Images I...	2021-12-14	-
6	VATT-Large	67.7	Yes	VATT: Transformers for Multimodal Self-Supervise...	2021-04-22	Code
7	VTN	65.4	Yes	Video Transformer Network	2021-02-01	Code
8	ViViT-L/16x2	64.9	Yes	ViViT: A Video Vision Transformer	2021-03-29	Code
9	MBT (AV)	61.2	No	Attention Bottlenecks for Multimodal Fusion	2021-06-30	Code
10	SRTG r3d-101	58.49	No	Learn to cycle: Time-consistent feature discover...	2020-06-15	Code
11	SRTG r(2+1)d-50	56.8	No	Learn to cycle: Time-consistent feature discover...	2020-06-15	Code
12	SRTG r3d-50	55.65	No	Learn to cycle: Time-consistent feature discover...	2020-06-15	Code
13	SRTG r(2+1)d-34	54.18	No	Learn to cycle: Time-consistent feature discover...	2020-06-15	Code
14	TRN-Multiscale	53.87	No	Moments in Time Dataset: one million videos for ...	2018-01-09	Code
15	SRTG r3d-34	52.35	No	Learn to cycle: Time-consistent feature discover...	2020-06-15	Code

#1UMT-L (ViT-L/16)SOTA
78.2
Top 5 Accuracy· Extra Data· 2023-03-28
Unmasked Teacher: Towards Training-Efficient Video Foundation Models Code
#2UniFormerV2-L
76.9
Top 5 Accuracy· Extra Data
No paperCode
#3MTV-H (WTS 60M)SOTA
75.7
Top 5 Accuracy· Extra Data· 2022-01-12
Multiview Transformers for Video Recognition Code
#4CoVeR(JFT-3B)SOTA
75.4
Top 5 Accuracy· Extra Data· 2021-12-14
Co-training Transformer with Videos and Images Improves Action Recognition
#5CoVeR(JFT-300M)
73.9
Top 5 Accuracy· Extra Data· 2021-12-14
Co-training Transformer with Videos and Images Improves Action Recognition
#6VATT-LargeSOTA
67.7
Top 5 Accuracy· Extra Data· 2021-04-22
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text Code
#7VTNSOTA
65.4
Top 5 Accuracy· Extra Data· 2021-02-01
Video Transformer Network Code
#8ViViT-L/16x2
64.9
Top 5 Accuracy· Extra Data· 2021-03-29
ViViT: A Video Vision Transformer Code
#9MBT (AV)
61.2
Top 5 Accuracy· 2021-06-30
Attention Bottlenecks for Multimodal Fusion Code
#10SRTG r3d-101SOTA
58.49
Top 5 Accuracy· 2020-06-15
Learn to cycle: Time-consistent feature discovery for action recognition Code
#11SRTG r(2+1)d-50
56.8
Top 5 Accuracy· 2020-06-15
Learn to cycle: Time-consistent feature discovery for action recognition Code
#12SRTG r3d-50
55.65
Top 5 Accuracy· 2020-06-15
Learn to cycle: Time-consistent feature discovery for action recognition Code
#13SRTG r(2+1)d-34
54.18
Top 5 Accuracy· 2020-06-15
Learn to cycle: Time-consistent feature discovery for action recognition Code
#14TRN-MultiscaleSOTA
53.87
Top 5 Accuracy· 2018-01-09
Moments in Time Dataset: one million videos for event understanding Code
#15SRTG r3d-34
52.35
Top 5 Accuracy· 2020-06-15
Learn to cycle: Time-consistent feature discovery for action recognition Code