Action Recognition on Something-Something V1

Metric: Top 1 Accuracy (higher is better)

LeaderboardDataset

Loading chart...

Results

Hide extra data

Sort:

#	Model↕	Top 1 Accuracy▼	Extra Data	Paper	Date↕	Code
1	InternVideo	70	Yes	InternVideo: General Video Foundation Models via...	2022-12-06	Code
2	VideoMAE V2-g	68.7	Yes	VideoMAE V2: Scaling Video Masked Autoencoders w...	2023-03-29	Code
3	Side4Video (EVA ViT-E/14	67.3	No	Side4Video: Spatial-Temporal Side Network for Me...	2023-11-27	Code
4	ATM	65.6	No	What Can Simple Arithmetic Operations Do for Tem...	2023-07-18	Code
5	TAdaFormer-L/14	63.7	Yes	Temporally-Adaptive Models for Efficient Video U...	2023-08-10	Code
6	TDS-CLIP-ViT-L/14(8frames)	63	No	TDS-CLIP: Temporal Difference Side Network for I...	2024-08-20	Code
7	UniFormerV2-L	62.7	Yes	-	-	Code
8	StructVit-B-4-1	61.3	No	Learning Correlation Structures for Vision Trans...	2024-04-05	-
9	UniFormer-B (IN-1K + Kinetics400)	60.9	No	-	-	Code
10	TAdaConvNeXtV2-B	60.7	Yes	Temporally-Adaptive Models for Efficient Video U...	2023-08-10	Code
11	TPS	58.3	No	Spatiotemporal Self-attention Modeling with Temp...	2022-07-27	Code
12	MSMA (8+16frames)	57.9	No	-	-	-
13	UniFormer-B (IN-1K + Kinetics600)	57.6	No	-	-	Code
14	SIFA	57.3	No	Stand-Alone Inter-Frame Attention in Video Models	2022-06-14	Code
15	EAN ResNet50 (single clip, center crop,8+16 ensemble, with sparse Transformer)	57.2	No	EAN: Event Adaptive Network for Enhanced Action ...	2021-07-22	Code
16	TCM (Ensemble)	57.2	No	Motion-driven Visual Tempo Learning for Video-ba...	2022-02-24	Code
17	BQNEn (ImageNet + K400 pretrained)	57.1	No	Busy-Quiet Video Disentangling for Video Classif...	2021-03-29	Code
18	TDN ResNet101 (one clip, center crop, 8+16 ensemble, ImageNet pretrained, RGB only)	56.8	No	TDN: Temporal Difference Networks for Efficient ...	2020-12-18	Code
19	SELFYNet-TSM-R50En (8+16 frames, ImageNet pretrained, 2 clips)	56.6	Yes	Learning Self-Similarity in Space and Time as Ge...	2021-02-14	Code
20	CT-Net Ensemble (R50, 8+12+16+24)	56.6	No	CT-Net: Channel Tensorization Network for Video ...	2021-06-03	Code
21	MoDS (8+16frames)	56.6	No	-	-	-
22	MLP-3D	56.5	No	MLP-3D: A MLP-like 3D Architecture with Grouped ...	2022-06-13	-
23	RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips)	56.1	No	Relational Self-Attention: What's Missing in Att...	2021-11-02	Code
24	SELFYNet-TSM-R50En (8+16 frames, ImageNet pretrained, a single clip)	55.8	Yes	Learning Self-Similarity in Space and Time as Ge...	2021-02-14	Code
25	RSANet-R50 (8+16 frames, ImageNet pretrained, a single clip)	55.5	No	Relational Self-Attention: What's Missing in Att...	2021-11-02	Code
26	PAN ResNet101 (RGB only, no Flow)	55.3	No	PAN: Towards Fast Action Recognition via Learnin...	2020-08-08	Code
27	GSM Ensemble InceptionV3 (ImageNet pretrained)	55.16	Yes	Gate-Shift Networks for Video Action Recognition	2019-12-01	Code
28	MSNet-R50En (ensemble)	55.1	Yes	MotionSqueeze: Neural Motion Feature Learning fo...	2020-07-20	Code
29	AE-Net (8+16frames)	55	No	-	-	-
30	VoV3D-L (32frames, Kinetics pretrained, single)	54.59	Yes	Diverse Temporal Aggregation and Depthwise Spati...	2020-12-01	Code
31	MSNet-R50En (8+16 ensemble, ImageNet pretrained)	54.4	Yes	MotionSqueeze: Neural Motion Feature Learning fo...	2020-07-20	Code
32	SELFYNet-TSM-R50 (16 frames, ImageNet pretrained)	54.3	Yes	Learning Self-Similarity in Space and Time as Ge...	2021-02-14	Code
33	RNL+TSM Ensemble(R50+R101, ImageNet pretrained)	54.1	No	Region-based Non-local Operation for Video Class...	2020-07-17	Code
34	RSANet-R50 (16 frames, ImageNet pretrained, a single clip)	54	No	Relational Self-Attention: What's Missing in Att...	2021-11-02	Code
35	MVFNet-R50EN	54	No	MVFNet: Multi-View Fusion Network for Efficient ...	2020-12-13	Code
36	STPG (8+16frames)	53.5	No	-	-	-
37	GB + DF + LB (ResNet152, ImageNet pretrained)	53.4	Yes	Action recognition with spatial-temporal discrim...	2019-08-20	-
38	ip-CSN-152 (IG-65M pretraining)	53.3	No	Video Classification with Channel-Separated Conv...	2019-04-04	Code
39	MARS+RGB+Flow (64 frames, Kinetics pretrained)	53	Yes	-	-	Code
40	RNL+TSM Ensemble(ResNet50, ImageNet pretrained)	52.7	No	Region-based Non-local Operation for Video Class...	2020-07-17	Code
41	VoV3D-M (32frames, Kinetics pretrained, single)	52.68	Yes	Diverse Temporal Aggregation and Depthwise Spati...	2020-12-01	Code
42	TSM+W3 (16 frames, ResNet50)	52.6	No	Knowing What, Where and When to Look: Efficient ...	2020-04-02	-
43	AK-Net	52.5	No	Action Keypoint Network for Efficient Video Reco...	2022-01-17	-
44	MSNet-R50 (16 frames, ImageNet pretrained)	52.1	Yes	MotionSqueeze: Neural Motion Feature Learning fo...	2020-07-20	Code
45	ir-CSN-152 (IG-65M pretraining)	52.1	No	Video Classification with Channel-Separated Conv...	2019-04-04	Code
46	RSANet-R50 (8 frames, ImageNet pretrained, a single clip)	51.9	No	Relational Self-Attention: What's Missing in Att...	2021-11-02	Code
47	GSM InceptionV3 (16 frames, ImageNet pretrained)	51.68	Yes	Gate-Shift Networks for Video Action Recognition	2019-12-01	Code
48	R(2+1)D-152 (IG-65M pretraining)	51.6	No	Video Classification with Channel-Separated Conv...	2019-04-04	Code
49	MSNet-R50 (8 frames, ImageNet pretrained)	50.9	No	MotionSqueeze: Neural Motion Feature Learning fo...	2020-07-20	Code
50	TSM (RGB + Flow)	50.7	No	TSM: Temporal Shift Module for Efficient Video U...	2018-11-20	Code
51	STM (16 frames, ImageNet pretraining)	50.7	No	STM: SpatioTemporal and Motion Encoding for Acti...	2019-08-07	-
52	VoV3D-L (32frames, from scratch, single)	50.6	No	Diverse Temporal Aggregation and Depthwise Spati...	2020-12-01	Code
53	ResNet50 I3D (Moments pretrained)	50	Yes	Moments in Time Dataset: one million videos for ...	2018-01-09	Code
54	VoV3D-M (32frames, from scratch, single)	49.8	No	Diverse Temporal Aggregation and Depthwise Spati...	2020-12-01	Code
55	TSMEn	49.7	No	TSM: Temporal Shift Module for Efficient Video U...	2018-11-20	Code
56	TRG (Inception-V3)	49.7	No	Temporal Reasoning Graph for Activity Recognition	2019-08-27	-
57	TRG (ResNet-50)	49.5	No	Temporal Reasoning Graph for Activity Recognition	2019-08-27	-
58	VoV3D-L (16frames, from scratch, single)	49.5	No	Diverse Temporal Aggregation and Depthwise Spati...	2020-12-01	Code
59	ir-CSN-152	49.3	No	Video Classification with Channel-Separated Conv...	2019-04-04	Code
60	RSTG (Kinetics pretrained)	49.2	Yes	Recurrent Space-time Graph Neural Networks	2019-04-11	Code
61	ResNet50 I3D (Kinetics pretrained)	48.6	Yes	Moments in Time Dataset: one million videos for ...	2018-01-09	Code
62	ir-CSN-101	48.4	No	Video Classification with Channel-Separated Conv...	2019-04-04	Code
63	S3D-G (ImageNet pretrained)	48.2	Yes	Rethinking Spatiotemporal Feature Learning: Spee...	2017-12-13	Code
64	VoV3D-M (16frames, from scratch, single)	48.1	No	Diverse Temporal Aggregation and Depthwise Spati...	2020-12-01	Code
65	S3D	47.3	No	Rethinking Spatiotemporal Feature Learning: Spee...	2017-12-13	Code
66	TSM	47.2	No	TSM: Temporal Shift Module for Efficient Video U...	2018-11-20	Code
67	ECO-Net (ImageNet pretrained)	46.4	Yes	ECO: Efficient Convolutional Network for Online ...	2018-04-24	Code
68	ECO-Net	46.4	No	ECO: Efficient Convolutional Network for Online ...	2018-04-24	Code
69	NL I3D + GCN	46.1	No	Videos as Space-Time Region Graphs	2018-06-05	-
70	NL I3D	44.4	No	Non-local Neural Networks	2017-11-21	Code
71	Motion Feature Net	43.9	No	Motion Feature Network: Fixed Motion Filter for ...	2018-07-26	-
72	Motion Feature Net	43.9	No	Motion Feature Network: Fixed Motion Filter for ...	2018-07-26	-
73	2-Stream TRN	42.01	No	Temporal Relational Reasoning in Videos	2017-11-22	Code
74	2-Stream TRN	42.01	No	Temporal Relational Reasoning in Videos	2017-11-22	Code
75	HF-TSN (ImageNet pretraining)	41.97	Yes	Hierarchical Feature Aggregation Networks for Vi...	2019-05-29	-
76	MARS+RGB+Flow (16 frames, Kinetics pretrained)	40.4	No	-	-	Code
77	M-TRN	34.4	No	Temporal Relational Reasoning in Videos	2017-11-22	Code

#1InternVideoSOTA
70
Top 1 Accuracy· Extra Data· 2022-12-06
InternVideo: General Video Foundation Models via Generative and Discriminative Learning Code
#2VideoMAE V2-g
68.7
Top 1 Accuracy· Extra Data· 2023-03-29
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking Code
#3Side4Video (EVA ViT-E/14
67.3
Top 1 Accuracy· 2023-11-27
Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning Code
#4ATM
65.6
Top 1 Accuracy· 2023-07-18
What Can Simple Arithmetic Operations Do for Temporal Modeling?Code
#5TAdaFormer-L/14
63.7
Top 1 Accuracy· Extra Data· 2023-08-10
Temporally-Adaptive Models for Efficient Video Understanding Code
#6TDS-CLIP-ViT-L/14(8frames)
63
Top 1 Accuracy· 2024-08-20
TDS-CLIP: Temporal Difference Side Network for Image-to-Video Transfer Learning Code
#7UniFormerV2-L
62.7
Top 1 Accuracy· Extra Data
No paperCode
#8StructVit-B-4-1
61.3
Top 1 Accuracy· 2024-04-05
Learning Correlation Structures for Vision Transformers
#9UniFormer-B (IN-1K + Kinetics400)
60.9
Top 1 Accuracy
No paperCode
#10TAdaConvNeXtV2-B
60.7
Top 1 Accuracy· Extra Data· 2023-08-10
Temporally-Adaptive Models for Efficient Video Understanding Code
#11TPSSOTA
58.3
Top 1 Accuracy· 2022-07-27
Spatiotemporal Self-attention Modeling with Temporal Patch Shift for Action Recognition Code
#12MSMA (8+16frames)
57.9
Top 1 Accuracy
No paper
#13UniFormer-B (IN-1K + Kinetics600)
57.6
Top 1 Accuracy
No paperCode
#14SIFASOTA
57.3
Top 1 Accuracy· 2022-06-14
Stand-Alone Inter-Frame Attention in Video Models Code
#15EAN ResNet50 (single clip, center crop,8+16 ensemble, with sparse Transformer)SOTA
57.2
Top 1 Accuracy· 2021-07-22
EAN: Event Adaptive Network for Enhanced Action Recognition Code
#16TCM (Ensemble)
57.2
Top 1 Accuracy· 2022-02-24
Motion-driven Visual Tempo Learning for Video-based Action Recognition Code
#17BQNEn (ImageNet + K400 pretrained)SOTA
57.1
Top 1 Accuracy· 2021-03-29
Busy-Quiet Video Disentangling for Video Classification Code
#18TDN ResNet101 (one clip, center crop, 8+16 ensemble, ImageNet pretrained, RGB only)SOTA
56.8
Top 1 Accuracy· 2020-12-18
TDN: Temporal Difference Networks for Efficient Action Recognition Code
#19SELFYNet-TSM-R50En (8+16 frames, ImageNet pretrained, 2 clips)
56.6
Top 1 Accuracy· Extra Data· 2021-02-14
Learning Self-Similarity in Space and Time as Generalized Motion for Video Action Recognition Code
#20CT-Net Ensemble (R50, 8+12+16+24)
56.6
Top 1 Accuracy· 2021-06-03
CT-Net: Channel Tensorization Network for Video Classification Code
#21MoDS (8+16frames)
56.6
Top 1 Accuracy
No paper
#22MLP-3D
56.5
Top 1 Accuracy· 2022-06-13
MLP-3D: A MLP-like 3D Architecture with Grouped Time Mixing
#23RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips)
56.1
Top 1 Accuracy· 2021-11-02
Relational Self-Attention: What's Missing in Attention for Video Understanding Code
#24SELFYNet-TSM-R50En (8+16 frames, ImageNet pretrained, a single clip)
55.8
Top 1 Accuracy· Extra Data· 2021-02-14
Learning Self-Similarity in Space and Time as Generalized Motion for Video Action Recognition Code
#25RSANet-R50 (8+16 frames, ImageNet pretrained, a single clip)
55.5
Top 1 Accuracy· 2021-11-02
Relational Self-Attention: What's Missing in Attention for Video Understanding Code
#26PAN ResNet101 (RGB only, no Flow)SOTA
55.3
Top 1 Accuracy· 2020-08-08
PAN: Towards Fast Action Recognition via Learning Persistence of Appearance Code
#27GSM Ensemble InceptionV3 (ImageNet pretrained)SOTA
55.16
Top 1 Accuracy· Extra Data· 2019-12-01
Gate-Shift Networks for Video Action Recognition Code
#28MSNet-R50En (ensemble)
55.1
Top 1 Accuracy· Extra Data· 2020-07-20
MotionSqueeze: Neural Motion Feature Learning for Video Understanding Code
#29AE-Net (8+16frames)
55
Top 1 Accuracy
No paper
#30VoV3D-L (32frames, Kinetics pretrained, single)
54.59
Top 1 Accuracy· Extra Data· 2020-12-01
Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification Code
#31MSNet-R50En (8+16 ensemble, ImageNet pretrained)
54.4
Top 1 Accuracy· Extra Data· 2020-07-20
MotionSqueeze: Neural Motion Feature Learning for Video Understanding Code
#32SELFYNet-TSM-R50 (16 frames, ImageNet pretrained)
54.3
Top 1 Accuracy· Extra Data· 2021-02-14
Learning Self-Similarity in Space and Time as Generalized Motion for Video Action Recognition Code
#33RNL+TSM Ensemble(R50+R101, ImageNet pretrained)
54.1
Top 1 Accuracy· 2020-07-17
Region-based Non-local Operation for Video Classification Code
#34RSANet-R50 (16 frames, ImageNet pretrained, a single clip)
54
Top 1 Accuracy· 2021-11-02
Relational Self-Attention: What's Missing in Attention for Video Understanding Code
#35MVFNet-R50EN
54
Top 1 Accuracy· 2020-12-13
MVFNet: Multi-View Fusion Network for Efficient Video Recognition Code
#36STPG (8+16frames)
53.5
Top 1 Accuracy
No paper
#37GB + DF + LB (ResNet152, ImageNet pretrained)SOTA
53.4
Top 1 Accuracy· Extra Data· 2019-08-20
Action recognition with spatial-temporal discriminative filter banks
#38ip-CSN-152 (IG-65M pretraining)SOTA
53.3
Top 1 Accuracy· 2019-04-04
Video Classification with Channel-Separated Convolutional Networks Code
#39MARS+RGB+Flow (64 frames, Kinetics pretrained)
53
Top 1 Accuracy· Extra Data
No paperCode
#40RNL+TSM Ensemble(ResNet50, ImageNet pretrained)
52.7
Top 1 Accuracy· 2020-07-17
Region-based Non-local Operation for Video Classification Code
#41VoV3D-M (32frames, Kinetics pretrained, single)
52.68
Top 1 Accuracy· Extra Data· 2020-12-01
Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification Code
#42TSM+W3 (16 frames, ResNet50)
52.6
Top 1 Accuracy· 2020-04-02
Knowing What, Where and When to Look: Efficient Video Action Modeling with Attention
#43AK-Net
52.5
Top 1 Accuracy· 2022-01-17
Action Keypoint Network for Efficient Video Recognition
#44MSNet-R50 (16 frames, ImageNet pretrained)
52.1
Top 1 Accuracy· Extra Data· 2020-07-20
MotionSqueeze: Neural Motion Feature Learning for Video Understanding Code
#45ir-CSN-152 (IG-65M pretraining)
52.1
Top 1 Accuracy· 2019-04-04
Video Classification with Channel-Separated Convolutional Networks Code
#46RSANet-R50 (8 frames, ImageNet pretrained, a single clip)
51.9
Top 1 Accuracy· 2021-11-02
Relational Self-Attention: What's Missing in Attention for Video Understanding Code
#47GSM InceptionV3 (16 frames, ImageNet pretrained)
51.68
Top 1 Accuracy· Extra Data· 2019-12-01
Gate-Shift Networks for Video Action Recognition Code
#48R(2+1)D-152 (IG-65M pretraining)
51.6
Top 1 Accuracy· 2019-04-04
Video Classification with Channel-Separated Convolutional Networks Code
#49MSNet-R50 (8 frames, ImageNet pretrained)
50.9
Top 1 Accuracy· 2020-07-20
MotionSqueeze: Neural Motion Feature Learning for Video Understanding Code
#50TSM (RGB + Flow)SOTA
50.7
Top 1 Accuracy· 2018-11-20
TSM: Temporal Shift Module for Efficient Video Understanding Code
#51STM (16 frames, ImageNet pretraining)
50.7
Top 1 Accuracy· 2019-08-07
STM: SpatioTemporal and Motion Encoding for Action Recognition
#52VoV3D-L (32frames, from scratch, single)
50.6
Top 1 Accuracy· 2020-12-01
Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification Code
#53ResNet50 I3D (Moments pretrained)SOTA
50
Top 1 Accuracy· Extra Data· 2018-01-09
Moments in Time Dataset: one million videos for event understanding Code
#54VoV3D-M (32frames, from scratch, single)
49.8
Top 1 Accuracy· 2020-12-01
Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification Code
#55TSMEn
49.7
Top 1 Accuracy· 2018-11-20
TSM: Temporal Shift Module for Efficient Video Understanding Code
#56TRG (Inception-V3)
49.7
Top 1 Accuracy· 2019-08-27
Temporal Reasoning Graph for Activity Recognition
#57TRG (ResNet-50)
49.5
Top 1 Accuracy· 2019-08-27
Temporal Reasoning Graph for Activity Recognition
#58VoV3D-L (16frames, from scratch, single)
49.5
Top 1 Accuracy· 2020-12-01
Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification Code
#59ir-CSN-152
49.3
Top 1 Accuracy· 2019-04-04
Video Classification with Channel-Separated Convolutional Networks Code
#60RSTG (Kinetics pretrained)
49.2
Top 1 Accuracy· Extra Data· 2019-04-11
Recurrent Space-time Graph Neural Networks Code
#61ResNet50 I3D (Kinetics pretrained)
48.6
Top 1 Accuracy· Extra Data· 2018-01-09
Moments in Time Dataset: one million videos for event understanding Code
#62ir-CSN-101
48.4
Top 1 Accuracy· 2019-04-04
Video Classification with Channel-Separated Convolutional Networks Code
#63S3D-G (ImageNet pretrained)SOTA
48.2
Top 1 Accuracy· Extra Data· 2017-12-13
Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification Code
#64VoV3D-M (16frames, from scratch, single)
48.1
Top 1 Accuracy· 2020-12-01
Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification Code
#65S3D
47.3
Top 1 Accuracy· 2017-12-13
Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification Code
#66TSM
47.2
Top 1 Accuracy· 2018-11-20
TSM: Temporal Shift Module for Efficient Video Understanding Code
#67ECO-Net (ImageNet pretrained)
46.4
Top 1 Accuracy· Extra Data· 2018-04-24
ECO: Efficient Convolutional Network for Online Video Understanding Code
#68ECO-Net
46.4
Top 1 Accuracy· 2018-04-24
ECO: Efficient Convolutional Network for Online Video Understanding Code
#69NL I3D + GCN
46.1
Top 1 Accuracy· 2018-06-05
Videos as Space-Time Region Graphs
#70NL I3DSOTA
44.4
Top 1 Accuracy· 2017-11-21
Non-local Neural Networks Code
#71Motion Feature Net
43.9
Top 1 Accuracy· 2018-07-26
Motion Feature Network: Fixed Motion Filter for Action Recognition
#72Motion Feature Net
43.9
Top 1 Accuracy· 2018-07-26
Motion Feature Network: Fixed Motion Filter for Action Recognition
#732-Stream TRN
42.01
Top 1 Accuracy· 2017-11-22
Temporal Relational Reasoning in Videos Code
#742-Stream TRN
42.01
Top 1 Accuracy· 2017-11-22
Temporal Relational Reasoning in Videos Code
#75HF-TSN (ImageNet pretraining)
41.97
Top 1 Accuracy· Extra Data· 2019-05-29
Hierarchical Feature Aggregation Networks for Video Action Recognition
#76MARS+RGB+Flow (16 frames, Kinetics pretrained)
40.4
Top 1 Accuracy
No paperCode
#77M-TRN
34.4
Top 1 Accuracy· 2017-11-22
Temporal Relational Reasoning in Videos Code