Action Recognition on Something-Something V2

Metric: Top-5 Accuracy (higher is better)

LeaderboardDataset

Loading chart...

Results

Hide extra data

Sort:

#	Model↕	Top-5 Accuracy▼	Extra Data	Paper	Date↕	Code
1	DejaVid	96.3	Yes	-	-	Code
2	VideoMAE V2-g	95.9	Yes	VideoMAE V2: Scaling Video Masked Autoencoders w...	2023-03-29	Code
3	MVD (Kinetics400 pretrain, ViT-H, 16 frame)	95.7	Yes	Masked Video Distillation: Rethinking Masked Fea...	2022-12-08	Code
4	MVD (Kinetics400 pretrain, ViT-L, 16 frame)	95.5	Yes	Masked Video Distillation: Rethinking Masked Fea...	2022-12-08	Code
5	TubeViT-L	95.2	No	Rethinking Video ViTs: Sparse Video Tubes for Jo...	2022-12-06	Code
6	VideoMAE (no extra data, ViT-L, 32x2)	95.2	No	VideoMAE: Masked Autoencoders are Data-Efficient...	2022-03-23	Code
7	MaskFeat (Kinetics600 pretrain, MViT-L)	95	Yes	Masked Feature Prediction for Self-Supervised Vi...	2021-12-16	Code
8	MAR (50% mask, ViT-L, 16x4)	94.9	No	MAR: Masked Autoencoders for Efficient Action Re...	2022-07-24	Code
9	VideoMAE (no extra data, ViT-L, 16frame)	94.6	No	VideoMAE: Masked Autoencoders are Data-Efficient...	2022-03-23	Code
10	UniFormerV2-L	94.5	Yes	-	-	Code
11	ATM	94.4	No	What Can Simple Arithmetic Operations Do for Tem...	2023-07-18	Code
12	MAR (75% mask, ViT-L, 16x4)	94.4	No	MAR: Masked Autoencoders for Efficient Action Re...	2022-07-24	Code
13	MViTv2-L (IN-21K + Kinetics400 pretrain)	94.1	No	MViTv2: Improved Multiscale Vision Transformers ...	2021-12-02	Code
14	Side4Video (EVA ViT-E/14)	94	No	Side4Video: Spatial-Temporal Side Network for Me...	2023-11-27	Code
15	MVD (Kinetics400 pretrain, ViT-B, 16 frame)	94	Yes	Masked Video Distillation: Rethinking Masked Fea...	2022-12-08	Code
16	AMD(ViT-B/16)	94	No	Asymmetric Masked Distillation for Pre-Training ...	2023-11-06	-
17	ST-Adapter (ViT-L, CLIP)	93.9	Yes	ST-Adapter: Parameter-Efficient Image-to-Video T...	2022-06-27	Code
18	TDS-CLIP-ViT-L/14(8frames)	93.8	No	TDS-CLIP: Temporal Difference Side Network for I...	2024-08-20	Code
19	OMNIVORE (Swin-B, IN-21K+ Kinetics400 pretrain)	93.5	Yes	Omnivore: A Single Model for Many Visual Modalit...	2022-01-20	Code
20	MViTv2-B (IN-21K + Kinetics400 pretrain)	93.4	No	MViTv2: Improved Multiscale Vision Transformers ...	2021-12-02	Code
21	ZeroI2V ViT-L/14	93	Yes	ZeroI2V: Zero-Cost Adaptation of Pre-trained Tra...	2023-10-02	Code
22	UniFormer-B (IN-1K + Kinetics400 pretrain)	92.8	Yes	-	-	Code
23	MAR (50% mask, ViT-B, 16x4)	92.8	No	MAR: Masked Autoencoders for Efficient Action Re...	2022-07-24	Code
24	MVD (Kinetics400 pretrain, ViT-S, 16 frame)	92.8	Yes	Masked Video Distillation: Rethinking Masked Fea...	2022-12-08	Code
25	MorphMLP-B (IN-1K)	92.8	Yes	MorphMLP: An Efficient MLP-Like Backbone for Spa...	2021-11-24	Code
26	Swin-B (IN-21K + Kinetics400 pretrain)	92.7	Yes	Video Swin Transformer	2021-06-24	Code
27	MML (ensemble)	92.7	Yes	Mutual Modality Learning for Video Action Classi...	2020-11-04	Code
28	CoVeR(JFT-3B)	92.5	Yes	Co-training Transformer with Videos and Images I...	2021-12-14	-
29	AMD(ViT-S/16)	92.5	No	Asymmetric Masked Distillation for Pre-Training ...	2023-11-06	-
30	VideoMAE (no extra data, ViT-B, 16frame)	92.4	No	VideoMAE: Masked Autoencoders are Data-Efficient...	2022-03-23	Code
31	TDN ResNet101 (one clip, three crop, 8+16 ensemble, ImageNet pretrained, RGB only)	92.2	Yes	TDN: Temporal Difference Networks for Efficient ...	2020-12-18	Code
32	UniFormer-S (IN-1K + Kinetics600 pretrain)	92.1	Yes	-	-	Code
33	CoVeR(JFT-300M)	91.9	Yes	Co-training Transformer with Videos and Images I...	2021-12-14	-
34	MAR (75% mask, ViT-B, 16x4)	91.9	No	MAR: Masked Autoencoders for Efficient Action Re...	2022-07-24	Code
35	ILA (ViT-L/14)	91.8	No	Implicit Temporal Modeling with Learnable Alignm...	2023-04-20	Code
36	TDN ResNet101 (one clip, center crop, 8+16 ensemble, ImageNet pretrained, RGB only)	91.6	Yes	TDN: Temporal Difference Networks for Efficient ...	2020-12-18	Code
37	ORViT Mformer-L (ORViT blocks)	91.5	Yes	Object-Region Video Transformers	2021-10-13	Code
38	MViT-B-24, 32x3	91.5	Yes	Multiscale Vision Transformers	2021-04-22	Code
39	TRG (Inception-V3)	91.4	No	Temporal Reasoning Graph for Activity Recognition	2019-08-27	-
40	MViT-B, 32x3(Kinetics600 pretrain)	91.3	Yes	Multiscale Vision Transformers	2021-04-22	Code
41	MML (single)	91.3	Yes	Mutual Modality Learning for Video Action Classi...	2020-11-04	Code
42	TSM (RGB + Flow)	91.3	Yes	TSM: Temporal Shift Module for Efficient Video U...	2018-11-20	Code
43	Mformer-L	91.2	Yes	Keeping Your Eye on the Ball: Trajectory Attenti...	2021-06-09	Code
44	GC-TDN Ensemble (R50,8+16)	91.2	Yes	Group Contextualization for Video Recognition	2022-03-18	Code
45	CT-Net Ensemble (R50, 8+12+16+24)	91.1	Yes	CT-Net: Channel Tensorization Network for Video ...	2021-06-03	Code
46	SELFYNet-TSM-R50En (8+16 frames, ImageNet pretrained, 2 clips)	91.1	Yes	Learning Self-Similarity in Space and Time as Ge...	2021-02-14	Code
47	RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips	91.1	Yes	Relational Self-Attention: What's Missing in Att...	2021-11-02	Code
48	RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips)	91.1	No	Relational Self-Attention: What's Missing in Att...	2021-11-02	Code
49	SELFYNet-TSM-R50En (8+16 frames, ImageNet pretrained, a single clip)	91	Yes	Learning Self-Similarity in Space and Time as Ge...	2021-02-14	Code
50	PLAR	91	No	SCP: Soft Conditional Prompt Learning for Aerial...	2023-05-21	-
51	RSANet-R50 (8+16 frames, ImageNet pretrained, a single clip)	90.8	Yes	Relational Self-Attention: What's Missing in Att...	2021-11-02	Code
52	X-Vit (x16)	90.8	Yes	Space-time Mixing Attention for Video Transformer	2021-06-10	Code
53	Mformer-HR	90.6	Yes	Keeping Your Eye on the Ball: Trajectory Attenti...	2021-06-09	Code
54	MSNet-R50En (8+16 ensemble, ImageNet pretrained)	90.6	Yes	MotionSqueeze: Neural Motion Feature Learning fo...	2020-07-20	Code
55	PAN ResNet101 (RGB only, no Flow)	90.6	Yes	PAN: Towards Fast Action Recognition via Learnin...	2020-08-08	Code
56	ORViT Mformer (ORViT blocks)	90.5	Yes	Object-Region Video Transformers	2021-10-13	Code
57	VoV3D-L (32frames, Kinetics pretrained, single)	90.5	Yes	Diverse Temporal Aggregation and Depthwise Spati...	2020-12-01	Code
58	MTV-B	90.4	Yes	Multiview Transformers for Video Recognition	2022-01-12	Code
59	TAdaConvNeXt-T	90.4	Yes	TAda! Temporally-Adaptive Convolutions for Video...	2021-10-12	Code
60	TSM+W3 (16 frames, RGB ResNet-50)	90.4	Yes	Knowing What, Where and When to Look: Efficient ...	2020-04-02	-
61	ILA (ViT-B/16)	90.3	No	Implicit Temporal Modeling with Learnable Alignm...	2023-04-20	Code
62	TRG (ResNet-50)	90.3	No	Temporal Reasoning Graph for Activity Recognition	2019-08-27	-
63	MViT-B, 16x4	90.2	Yes	Multiscale Vision Transformers	2021-04-22	Code
64	Mformer	90.1	Yes	Keeping Your Eye on the Ball: Trajectory Attenti...	2021-06-09	Code
65	TAda2D-En (ResNet-50, 8+16 frames)	89.8	Yes	TAda! Temporally-Adaptive Convolutions for Video...	2021-10-12	Code
66	RSANet-R50 (16 frames, ImageNet pretrained, a single clip)	89.8	Yes	Relational Self-Attention: What's Missing in Att...	2021-11-02	Code
67	E3D-L	89.8	No	Maximizing Spatio-Temporal Entropy of Deep 3D CN...	2023-03-05	Code
68	SELFYNet-TSM-R50 (16 frames, ImageNet pretrained)	89.8	Yes	Learning Self-Similarity in Space and Time as Ge...	2021-02-14	Code
69	ViViT-L/16x2 Fact. encoder	89.8	Yes	ViViT: A Video Vision Transformer	2021-03-29	Code
70	STM (16 frames, ImageNet pretraining)	89.8	No	STM: SpatioTemporal and Motion Encoding for Acti...	2019-08-07	-
71	VoV3D-L (32frames, from scratch, single)	89.5	No	Diverse Temporal Aggregation and Depthwise Spati...	2020-12-01	Code
72	VoV3D-M (32frames, Kinetics pretrained, single)	89.48	Yes	Diverse Temporal Aggregation and Depthwise Spati...	2020-12-01	Code
73	MSNet-R50 (16 frames, ImageNet pretrained)	89.4	Yes	MotionSqueeze: Neural Motion Feature Learning fo...	2020-07-20	Code
74	CCS + two-stream + TRN	89.3	No	Cooperative Cross-Stream Network for Discriminat...	2019-08-27	-
75	TAda2D (ResNet-50, 16 frames)	89.2	Yes	TAda! Temporally-Adaptive Convolutions for Video...	2021-10-12	Code
76	RSANet-R50 (8 frames, ImageNet pretrained, a single clip)	89.1	Yes	Relational Self-Attention: What's Missing in Att...	2021-11-02	Code
77	MoViNet-A2	89	No	MoViNets: Mobile Video Networks for Efficient Vi...	2021-03-21	Code
78	MoViNet-A1	89	No	MoViNets: Mobile Video Networks for Efficient Vi...	2021-03-21	Code
79	VoV3D-M (32frames, from scratch, single)	88.8	No	Diverse Temporal Aggregation and Depthwise Spati...	2020-12-01	Code
80	VoV3D-L (16frames, from scratch, single)	88.6	No	Diverse Temporal Aggregation and Depthwise Spati...	2020-12-01	Code
81	MSNet-R50 (8 frames, ImageNet pretrained)	88.4	Yes	MotionSqueeze: Neural Motion Feature Learning fo...	2020-07-20	Code
82	VoV3D-M (16frames, from scratch, single)	88.2	No	Diverse Temporal Aggregation and Depthwise Spati...	2020-12-01	Code
83	MoViNet-A0	88.2	No	MoViNets: Mobile Video Networks for Efficient Vi...	2021-03-21	Code
84	TAda2D (ResNet-50, 8 frames)	88	Yes	TAda! Temporally-Adaptive Convolutions for Video...	2021-10-12	Code
85	DirecFormer	87.9	No	DirecFormer: A Directed Attention in Transformer...	2022-03-19	Code
86	OmniVL	86.2	No	OmniVL:One Foundation Model for Image-Language a...	2022-09-15	-
87	CPNet Res34, 5 CP	83.95	No	Learning Video Representations from Corresponden...	2019-05-20	Code
88	2-Stream TRN	83.06	No	Temporal Relational Reasoning in Videos	2017-11-22	Code
89	model3D_1 with left-right augmentation and fps jitter	80.46	No	The "something something" video database for lea...	2017-06-13	Code
90	Prob-Distill	79.1	No	Attention Distillation for Learning Video Repres...	2019-04-05	-
91	InternVideo2-6B	12	Yes	InternVideo2: Scaling Foundation Models for Mult...	2024-03-22	Code

#1DejaVid
96.3
Top-5 Accuracy· Extra Data
No paperCode
#2VideoMAE V2-gSOTA
95.9
Top-5 Accuracy· Extra Data· 2023-03-29
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking Code
#3MVD (Kinetics400 pretrain, ViT-H, 16 frame)SOTA
95.7
Top-5 Accuracy· Extra Data· 2022-12-08
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning Code
#4MVD (Kinetics400 pretrain, ViT-L, 16 frame)
95.5
Top-5 Accuracy· Extra Data· 2022-12-08
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning Code
#5TubeViT-L
95.2
Top-5 Accuracy· 2022-12-06
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning Code
#6VideoMAE (no extra data, ViT-L, 32x2)SOTA
95.2
Top-5 Accuracy· 2022-03-23
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training Code
#7MaskFeat (Kinetics600 pretrain, MViT-L)SOTA
95
Top-5 Accuracy· Extra Data· 2021-12-16
Masked Feature Prediction for Self-Supervised Visual Pre-Training Code
#8MAR (50% mask, ViT-L, 16x4)
94.9
Top-5 Accuracy· 2022-07-24
MAR: Masked Autoencoders for Efficient Action Recognition Code
#9VideoMAE (no extra data, ViT-L, 16frame)
94.6
Top-5 Accuracy· 2022-03-23
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training Code
#10UniFormerV2-L
94.5
Top-5 Accuracy· Extra Data
No paperCode
#11ATM
94.4
Top-5 Accuracy· 2023-07-18
What Can Simple Arithmetic Operations Do for Temporal Modeling?Code
#12MAR (75% mask, ViT-L, 16x4)
94.4
Top-5 Accuracy· 2022-07-24
MAR: Masked Autoencoders for Efficient Action Recognition Code
#13MViTv2-L (IN-21K + Kinetics400 pretrain)SOTA
94.1
Top-5 Accuracy· 2021-12-02
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection Code
#14Side4Video (EVA ViT-E/14)
94
Top-5 Accuracy· 2023-11-27
Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning Code
#15MVD (Kinetics400 pretrain, ViT-B, 16 frame)
94
Top-5 Accuracy· Extra Data· 2022-12-08
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning Code
#16AMD(ViT-B/16)
94
Top-5 Accuracy· 2023-11-06
Asymmetric Masked Distillation for Pre-Training Small Foundation Models
#17ST-Adapter (ViT-L, CLIP)
93.9
Top-5 Accuracy· Extra Data· 2022-06-27
ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning Code
#18TDS-CLIP-ViT-L/14(8frames)
93.8
Top-5 Accuracy· 2024-08-20
TDS-CLIP: Temporal Difference Side Network for Image-to-Video Transfer Learning Code
#19OMNIVORE (Swin-B, IN-21K+ Kinetics400 pretrain)
93.5
Top-5 Accuracy· Extra Data· 2022-01-20
Omnivore: A Single Model for Many Visual Modalities Code
#20MViTv2-B (IN-21K + Kinetics400 pretrain)
93.4
Top-5 Accuracy· 2021-12-02
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection Code
#21ZeroI2V ViT-L/14
93
Top-5 Accuracy· Extra Data· 2023-10-02
ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to Video Code
#22UniFormer-B (IN-1K + Kinetics400 pretrain)
92.8
Top-5 Accuracy· Extra Data
No paperCode
#23MAR (50% mask, ViT-B, 16x4)
92.8
Top-5 Accuracy· 2022-07-24
MAR: Masked Autoencoders for Efficient Action Recognition Code
#24MVD (Kinetics400 pretrain, ViT-S, 16 frame)
92.8
Top-5 Accuracy· Extra Data· 2022-12-08
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning Code
#25MorphMLP-B (IN-1K)SOTA
92.8
Top-5 Accuracy· Extra Data· 2021-11-24
MorphMLP: An Efficient MLP-Like Backbone for Spatial-Temporal Representation Learning Code
#26Swin-B (IN-21K + Kinetics400 pretrain)
92.7
Top-5 Accuracy· Extra Data· 2021-06-24
Video Swin Transformer Code
#27MML (ensemble)SOTA
92.7
Top-5 Accuracy· Extra Data· 2020-11-04
Mutual Modality Learning for Video Action Classification Code
#28CoVeR(JFT-3B)
92.5
Top-5 Accuracy· Extra Data· 2021-12-14
Co-training Transformer with Videos and Images Improves Action Recognition
#29AMD(ViT-S/16)
92.5
Top-5 Accuracy· 2023-11-06
Asymmetric Masked Distillation for Pre-Training Small Foundation Models
#30VideoMAE (no extra data, ViT-B, 16frame)
92.4
Top-5 Accuracy· 2022-03-23
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training Code
#31TDN ResNet101 (one clip, three crop, 8+16 ensemble, ImageNet pretrained, RGB only)
92.2
Top-5 Accuracy· Extra Data· 2020-12-18
TDN: Temporal Difference Networks for Efficient Action Recognition Code
#32UniFormer-S (IN-1K + Kinetics600 pretrain)
92.1
Top-5 Accuracy· Extra Data
No paperCode
#33CoVeR(JFT-300M)
91.9
Top-5 Accuracy· Extra Data· 2021-12-14
Co-training Transformer with Videos and Images Improves Action Recognition
#34MAR (75% mask, ViT-B, 16x4)
91.9
Top-5 Accuracy· 2022-07-24
MAR: Masked Autoencoders for Efficient Action Recognition Code
#35ILA (ViT-L/14)
91.8
Top-5 Accuracy· 2023-04-20
Implicit Temporal Modeling with Learnable Alignment for Video Recognition Code
#36TDN ResNet101 (one clip, center crop, 8+16 ensemble, ImageNet pretrained, RGB only)
91.6
Top-5 Accuracy· Extra Data· 2020-12-18
TDN: Temporal Difference Networks for Efficient Action Recognition Code
#37ORViT Mformer-L (ORViT blocks)
91.5
Top-5 Accuracy· Extra Data· 2021-10-13
Object-Region Video Transformers Code
#38MViT-B-24, 32x3
91.5
Top-5 Accuracy· Extra Data· 2021-04-22
Multiscale Vision Transformers Code
#39TRG (Inception-V3)SOTA
91.4
Top-5 Accuracy· 2019-08-27
Temporal Reasoning Graph for Activity Recognition
#40MViT-B, 32x3(Kinetics600 pretrain)
91.3
Top-5 Accuracy· Extra Data· 2021-04-22
Multiscale Vision Transformers Code
#41MML (single)
91.3
Top-5 Accuracy· Extra Data· 2020-11-04
Mutual Modality Learning for Video Action Classification Code
#42TSM (RGB + Flow)SOTA
91.3
Top-5 Accuracy· Extra Data· 2018-11-20
TSM: Temporal Shift Module for Efficient Video Understanding Code
#43Mformer-L
91.2
Top-5 Accuracy· Extra Data· 2021-06-09
Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers Code
#44GC-TDN Ensemble (R50,8+16)
91.2
Top-5 Accuracy· Extra Data· 2022-03-18
Group Contextualization for Video Recognition Code
#45CT-Net Ensemble (R50, 8+12+16+24)
91.1
Top-5 Accuracy· Extra Data· 2021-06-03
CT-Net: Channel Tensorization Network for Video Classification Code
#46SELFYNet-TSM-R50En (8+16 frames, ImageNet pretrained, 2 clips)
91.1
Top-5 Accuracy· Extra Data· 2021-02-14
Learning Self-Similarity in Space and Time as Generalized Motion for Video Action Recognition Code
#47RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips
91.1
Top-5 Accuracy· Extra Data· 2021-11-02
Relational Self-Attention: What's Missing in Attention for Video Understanding Code
#48RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips)
91.1
Top-5 Accuracy· 2021-11-02
Relational Self-Attention: What's Missing in Attention for Video Understanding Code
#49SELFYNet-TSM-R50En (8+16 frames, ImageNet pretrained, a single clip)
91
Top-5 Accuracy· Extra Data· 2021-02-14
Learning Self-Similarity in Space and Time as Generalized Motion for Video Action Recognition Code
#50PLAR
91
Top-5 Accuracy· 2023-05-21
SCP: Soft Conditional Prompt Learning for Aerial Video Action Recognition
#51RSANet-R50 (8+16 frames, ImageNet pretrained, a single clip)
90.8
Top-5 Accuracy· Extra Data· 2021-11-02
Relational Self-Attention: What's Missing in Attention for Video Understanding Code
#52X-Vit (x16)
90.8
Top-5 Accuracy· Extra Data· 2021-06-10
Space-time Mixing Attention for Video Transformer Code
#53Mformer-HR
90.6
Top-5 Accuracy· Extra Data· 2021-06-09
Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers Code
#54MSNet-R50En (8+16 ensemble, ImageNet pretrained)
90.6
Top-5 Accuracy· Extra Data· 2020-07-20
MotionSqueeze: Neural Motion Feature Learning for Video Understanding Code
#55PAN ResNet101 (RGB only, no Flow)
90.6
Top-5 Accuracy· Extra Data· 2020-08-08
PAN: Towards Fast Action Recognition via Learning Persistence of Appearance Code
#56ORViT Mformer (ORViT blocks)
90.5
Top-5 Accuracy· Extra Data· 2021-10-13
Object-Region Video Transformers Code
#57VoV3D-L (32frames, Kinetics pretrained, single)
90.5
Top-5 Accuracy· Extra Data· 2020-12-01
Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification Code
#58MTV-B
90.4
Top-5 Accuracy· Extra Data· 2022-01-12
Multiview Transformers for Video Recognition Code
#59TAdaConvNeXt-T
90.4
Top-5 Accuracy· Extra Data· 2021-10-12
TAda! Temporally-Adaptive Convolutions for Video Understanding Code
#60TSM+W3 (16 frames, RGB ResNet-50)
90.4
Top-5 Accuracy· Extra Data· 2020-04-02
Knowing What, Where and When to Look: Efficient Video Action Modeling with Attention
#61ILA (ViT-B/16)
90.3
Top-5 Accuracy· 2023-04-20
Implicit Temporal Modeling with Learnable Alignment for Video Recognition Code
#62TRG (ResNet-50)
90.3
Top-5 Accuracy· 2019-08-27
Temporal Reasoning Graph for Activity Recognition
#63MViT-B, 16x4
90.2
Top-5 Accuracy· Extra Data· 2021-04-22
Multiscale Vision Transformers Code
#64Mformer
90.1
Top-5 Accuracy· Extra Data· 2021-06-09
Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers Code
#65TAda2D-En (ResNet-50, 8+16 frames)
89.8
Top-5 Accuracy· Extra Data· 2021-10-12
TAda! Temporally-Adaptive Convolutions for Video Understanding Code
#66RSANet-R50 (16 frames, ImageNet pretrained, a single clip)
89.8
Top-5 Accuracy· Extra Data· 2021-11-02
Relational Self-Attention: What's Missing in Attention for Video Understanding Code
#67E3D-L
89.8
Top-5 Accuracy· 2023-03-05
Maximizing Spatio-Temporal Entropy of Deep 3D CNNs for Efficient Video Recognition Code
#68SELFYNet-TSM-R50 (16 frames, ImageNet pretrained)
89.8
Top-5 Accuracy· Extra Data· 2021-02-14
Learning Self-Similarity in Space and Time as Generalized Motion for Video Action Recognition Code
#69ViViT-L/16x2 Fact. encoder
89.8
Top-5 Accuracy· Extra Data· 2021-03-29
ViViT: A Video Vision Transformer Code
#70STM (16 frames, ImageNet pretraining)
89.8
Top-5 Accuracy· 2019-08-07
STM: SpatioTemporal and Motion Encoding for Action Recognition
#71VoV3D-L (32frames, from scratch, single)
89.5
Top-5 Accuracy· 2020-12-01
Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification Code
#72VoV3D-M (32frames, Kinetics pretrained, single)
89.48
Top-5 Accuracy· Extra Data· 2020-12-01
Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification Code
#73MSNet-R50 (16 frames, ImageNet pretrained)
89.4
Top-5 Accuracy· Extra Data· 2020-07-20
MotionSqueeze: Neural Motion Feature Learning for Video Understanding Code
#74CCS + two-stream + TRN
89.3
Top-5 Accuracy· 2019-08-27
Cooperative Cross-Stream Network for Discriminative Action Representation
#75TAda2D (ResNet-50, 16 frames)
89.2
Top-5 Accuracy· Extra Data· 2021-10-12
TAda! Temporally-Adaptive Convolutions for Video Understanding Code
#76RSANet-R50 (8 frames, ImageNet pretrained, a single clip)
89.1
Top-5 Accuracy· Extra Data· 2021-11-02
Relational Self-Attention: What's Missing in Attention for Video Understanding Code
#77MoViNet-A2
89
Top-5 Accuracy· 2021-03-21
MoViNets: Mobile Video Networks for Efficient Video Recognition Code
#78MoViNet-A1
89
Top-5 Accuracy· 2021-03-21
MoViNets: Mobile Video Networks for Efficient Video Recognition Code
#79VoV3D-M (32frames, from scratch, single)
88.8
Top-5 Accuracy· 2020-12-01
Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification Code
#80VoV3D-L (16frames, from scratch, single)
88.6
Top-5 Accuracy· 2020-12-01
Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification Code
#81MSNet-R50 (8 frames, ImageNet pretrained)
88.4
Top-5 Accuracy· Extra Data· 2020-07-20
MotionSqueeze: Neural Motion Feature Learning for Video Understanding Code
#82VoV3D-M (16frames, from scratch, single)
88.2
Top-5 Accuracy· 2020-12-01
Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification Code
#83MoViNet-A0
88.2
Top-5 Accuracy· 2021-03-21
MoViNets: Mobile Video Networks for Efficient Video Recognition Code
#84TAda2D (ResNet-50, 8 frames)
88
Top-5 Accuracy· Extra Data· 2021-10-12
TAda! Temporally-Adaptive Convolutions for Video Understanding Code
#85DirecFormer
87.9
Top-5 Accuracy· 2022-03-19
DirecFormer: A Directed Attention in Transformer Approach to Robust Action Recognition Code
#86OmniVL
86.2
Top-5 Accuracy· 2022-09-15
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks
#87CPNet Res34, 5 CP
83.95
Top-5 Accuracy· 2019-05-20
Learning Video Representations from Correspondence Proposals Code
#882-Stream TRNSOTA
83.06
Top-5 Accuracy· 2017-11-22
Temporal Relational Reasoning in Videos Code
#89model3D_1 with left-right augmentation and fps jitterSOTA
80.46
Top-5 Accuracy· 2017-06-13
The "something something" video database for learning and evaluating visual common sense Code
#90Prob-Distill
79.1
Top-5 Accuracy· 2019-04-05
Attention Distillation for Learning Video Representations
#91InternVideo2-6B
12
Top-5 Accuracy· Extra Data· 2024-03-22
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding Code