Action Recognition on Something-Something V1

Metric: Top 5 Accuracy (higher is better)

LeaderboardDataset

Loading chart...

Results

Hide extra data

Sort:

#	Model↕	Top 5 Accuracy▼	Extra Data	Paper	Date↕	Code
1	VideoMAE V2-g	91.9	Yes	VideoMAE V2: Scaling Video Masked Autoencoders w...	2023-03-29	Code
2	Side4Video (EVA ViT-E/14	88.8	No	Side4Video: Spatial-Temporal Side Network for Me...	2023-11-27	Code
3	ATM	88.6	No	What Can Simple Arithmetic Operations Do for Tem...	2023-07-18	Code
4	UniFormerV2-L	88	Yes	-	-	Code
5	TDS-CLIP-ViT-L/14(8frames)	87.8	No	TDS-CLIP: Temporal Difference Side Network for I...	2024-08-20	Code
6	UniFormer-B (IN-1K + Kinetics400)	87.3	No	-	-	Code
7	TRG (ResNet-50)	86.1	No	Temporal Reasoning Graph for Activity Recognition	2019-08-27	-
8	UniFormer-B (IN-1K + Kinetics600)	84.9	No	-	-	Code
9	SELFYNet-TSM-R50En (8+16 frames, ImageNet pretrained, 2 clips)	84.4	Yes	Learning Self-Similarity in Space and Time as Ge...	2021-02-14	Code
10	BQNEn (ImageNet + K400 pretrained)	84.2	No	Busy-Quiet Video Disentangling for Video Classif...	2021-03-29	Code
11	TDN ResNet101 (one clip, center crop, 8+16 ensemble, ImageNet pretrained, RGB only)	84.1	No	TDN: Temporal Difference Networks for Efficient ...	2020-12-18	Code
12	EAN ResNet50 (single clip, center crop,8+16 ensemble, with sparse Transformer)	83.9	No	EAN: Event Adaptive Network for Enhanced Action ...	2021-07-22	Code
13	SELFYNet-TSM-R50En (8+16 frames, ImageNet pretrained, a single clip)	83.9	Yes	Learning Self-Similarity in Space and Time as Ge...	2021-02-14	Code
14	MSNet-R50En (8+16 ensemble, ImageNet pretrained)	83.8	Yes	MotionSqueeze: Neural Motion Feature Learning fo...	2020-07-20	Code
15	SELFYNet-TSM-R50 (16 frames, ImageNet pretrained)	82.9	Yes	Learning Self-Similarity in Space and Time as Ge...	2021-02-14	Code
16	RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips)	82.8	No	Relational Self-Attention: What's Missing in Att...	2021-11-02	Code
17	PAN ResNet101 (RGB only, no Flow)	82.8	No	PAN: Towards Fast Action Recognition via Learnin...	2020-08-08	Code
18	RSANet-R50 (8+16 frames, ImageNet pretrained, a single clip)	82.6	No	Relational Self-Attention: What's Missing in Att...	2021-11-02	Code
19	VoV3D-L (32frames, Kinetics pretrained, single)	82.3	Yes	Diverse Temporal Aggregation and Depthwise Spati...	2020-12-01	Code
20	MSNet-R50 (16 frames, ImageNet pretrained)	82.3	Yes	MotionSqueeze: Neural Motion Feature Learning fo...	2020-07-20	Code
21	RNL+TSM Ensemble(R50+R101, ImageNet pretrained)	82.2	No	Region-based Non-local Operation for Video Class...	2020-07-17	Code
22	RNL+TSM Ensemble(ResNet50, ImageNet pretrained)	81.5	No	Region-based Non-local Operation for Video Class...	2020-07-17	Code
23	TSM+W3 (16 frames, ResNet50)	81.3	No	Knowing What, Where and When to Look: Efficient ...	2020-04-02	-
24	RSANet-R50 (16 frames, ImageNet pretrained, a single clip)	81.1	No	Relational Self-Attention: What's Missing in Att...	2021-11-02	Code
25	VoV3D-M (32frames, Kinetics pretrained, single)	80.43	Yes	Diverse Temporal Aggregation and Depthwise Spati...	2020-12-01	Code
26	MSNet-R50 (8 frames, ImageNet pretrained)	80.3	No	MotionSqueeze: Neural Motion Feature Learning fo...	2020-07-20	Code
27	RSANet-R50 (8 frames, ImageNet pretrained, a single clip)	79.6	No	Relational Self-Attention: What's Missing in Att...	2021-11-02	Code
28	VoV3D-L (32frames, from scratch, single)	78.7	No	Diverse Temporal Aggregation and Depthwise Spati...	2020-12-01	Code
29	S3D-G (ImageNet pretrained)	78.7	Yes	Rethinking Spatiotemporal Feature Learning: Spee...	2017-12-13	Code
30	TSMEn	78.5	No	TSM: Temporal Shift Module for Efficient Video U...	2018-11-20	Code
31	S3D	78.1	No	Rethinking Spatiotemporal Feature Learning: Spee...	2017-12-13	Code
32	VoV3D-M (32frames, from scratch, single)	78	No	Diverse Temporal Aggregation and Depthwise Spati...	2020-12-01	Code
33	VoV3D-L (16frames, from scratch, single)	78	No	Diverse Temporal Aggregation and Depthwise Spati...	2020-12-01	Code
34	TSM	77.1	No	TSM: Temporal Shift Module for Efficient Video U...	2018-11-20	Code
35	VoV3D-M (16frames, from scratch, single)	76.9	No	Diverse Temporal Aggregation and Depthwise Spati...	2020-12-01	Code

#1VideoMAE V2-gSOTA
91.9
Top 5 Accuracy· Extra Data· 2023-03-29
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking Code
#2Side4Video (EVA ViT-E/14
88.8
Top 5 Accuracy· 2023-11-27
Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning Code
#3ATM
88.6
Top 5 Accuracy· 2023-07-18
What Can Simple Arithmetic Operations Do for Temporal Modeling?Code
#4UniFormerV2-L
88
Top 5 Accuracy· Extra Data
No paperCode
#5TDS-CLIP-ViT-L/14(8frames)
87.8
Top 5 Accuracy· 2024-08-20
TDS-CLIP: Temporal Difference Side Network for Image-to-Video Transfer Learning Code
#6UniFormer-B (IN-1K + Kinetics400)
87.3
Top 5 Accuracy
No paperCode
#7TRG (ResNet-50)SOTA
86.1
Top 5 Accuracy· 2019-08-27
Temporal Reasoning Graph for Activity Recognition
#8UniFormer-B (IN-1K + Kinetics600)
84.9
Top 5 Accuracy
No paperCode
#9SELFYNet-TSM-R50En (8+16 frames, ImageNet pretrained, 2 clips)
84.4
Top 5 Accuracy· Extra Data· 2021-02-14
Learning Self-Similarity in Space and Time as Generalized Motion for Video Action Recognition Code
#10BQNEn (ImageNet + K400 pretrained)
84.2
Top 5 Accuracy· 2021-03-29
Busy-Quiet Video Disentangling for Video Classification Code
#11TDN ResNet101 (one clip, center crop, 8+16 ensemble, ImageNet pretrained, RGB only)
84.1
Top 5 Accuracy· 2020-12-18
TDN: Temporal Difference Networks for Efficient Action Recognition Code
#12EAN ResNet50 (single clip, center crop,8+16 ensemble, with sparse Transformer)
83.9
Top 5 Accuracy· 2021-07-22
EAN: Event Adaptive Network for Enhanced Action Recognition Code
#13SELFYNet-TSM-R50En (8+16 frames, ImageNet pretrained, a single clip)
83.9
Top 5 Accuracy· Extra Data· 2021-02-14
Learning Self-Similarity in Space and Time as Generalized Motion for Video Action Recognition Code
#14MSNet-R50En (8+16 ensemble, ImageNet pretrained)
83.8
Top 5 Accuracy· Extra Data· 2020-07-20
MotionSqueeze: Neural Motion Feature Learning for Video Understanding Code
#15SELFYNet-TSM-R50 (16 frames, ImageNet pretrained)
82.9
Top 5 Accuracy· Extra Data· 2021-02-14
Learning Self-Similarity in Space and Time as Generalized Motion for Video Action Recognition Code
#16RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips)
82.8
Top 5 Accuracy· 2021-11-02
Relational Self-Attention: What's Missing in Attention for Video Understanding Code
#17PAN ResNet101 (RGB only, no Flow)
82.8
Top 5 Accuracy· 2020-08-08
PAN: Towards Fast Action Recognition via Learning Persistence of Appearance Code
#18RSANet-R50 (8+16 frames, ImageNet pretrained, a single clip)
82.6
Top 5 Accuracy· 2021-11-02
Relational Self-Attention: What's Missing in Attention for Video Understanding Code
#19VoV3D-L (32frames, Kinetics pretrained, single)
82.3
Top 5 Accuracy· Extra Data· 2020-12-01
Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification Code
#20MSNet-R50 (16 frames, ImageNet pretrained)
82.3
Top 5 Accuracy· Extra Data· 2020-07-20
MotionSqueeze: Neural Motion Feature Learning for Video Understanding Code
#21RNL+TSM Ensemble(R50+R101, ImageNet pretrained)
82.2
Top 5 Accuracy· 2020-07-17
Region-based Non-local Operation for Video Classification Code
#22RNL+TSM Ensemble(ResNet50, ImageNet pretrained)
81.5
Top 5 Accuracy· 2020-07-17
Region-based Non-local Operation for Video Classification Code
#23TSM+W3 (16 frames, ResNet50)
81.3
Top 5 Accuracy· 2020-04-02
Knowing What, Where and When to Look: Efficient Video Action Modeling with Attention
#24RSANet-R50 (16 frames, ImageNet pretrained, a single clip)
81.1
Top 5 Accuracy· 2021-11-02
Relational Self-Attention: What's Missing in Attention for Video Understanding Code
#25VoV3D-M (32frames, Kinetics pretrained, single)
80.43
Top 5 Accuracy· Extra Data· 2020-12-01
Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification Code
#26MSNet-R50 (8 frames, ImageNet pretrained)
80.3
Top 5 Accuracy· 2020-07-20
MotionSqueeze: Neural Motion Feature Learning for Video Understanding Code
#27RSANet-R50 (8 frames, ImageNet pretrained, a single clip)
79.6
Top 5 Accuracy· 2021-11-02
Relational Self-Attention: What's Missing in Attention for Video Understanding Code
#28VoV3D-L (32frames, from scratch, single)
78.7
Top 5 Accuracy· 2020-12-01
Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification Code
#29S3D-G (ImageNet pretrained)SOTA
78.7
Top 5 Accuracy· Extra Data· 2017-12-13
Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification Code
#30TSMEn
78.5
Top 5 Accuracy· 2018-11-20
TSM: Temporal Shift Module for Efficient Video Understanding Code
#31S3D
78.1
Top 5 Accuracy· 2017-12-13
Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification Code
#32VoV3D-M (32frames, from scratch, single)
78
Top 5 Accuracy· 2020-12-01
Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification Code
#33VoV3D-L (16frames, from scratch, single)
78
Top 5 Accuracy· 2020-12-01
Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification Code
#34TSM
77.1
Top 5 Accuracy· 2018-11-20
TSM: Temporal Shift Module for Efficient Video Understanding Code
#35VoV3D-M (16frames, from scratch, single)
76.9
Top 5 Accuracy· 2020-12-01
Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification Code