Video on Kinetics-400

Metric: Acc@1 (higher is better)

LeaderboardDataset

Loading chart...

Results

Hide extra data

Sort:

#	Model↕	Acc@1▼	Extra Data	Paper	Date↕	Code
1	OmniVec2	93.6	No	-	-	-
2	FTP-UniFormerV2-L/14	93.4	No	Enhancing Video Transformers for Action Understa...	2024-03-24	-
3	InternVideo2-6B	92.1	Yes	InternVideo2: Scaling Foundation Models for Mult...	2024-03-22	Code
4	InternVideo2-1B	91.6	Yes	InternVideo2: Scaling Foundation Models for Mult...	2024-03-22	Code
5	InternVideo	91.1	No	InternVideo: General Video Foundation Models via...	2022-12-06	Code
6	OmniVec	91.1	No	OmniVec: Learning robust representations with cr...	2023-11-07	-
7	TubeViT-H (ImageNet-1k)	90.9	No	Rethinking Video ViTs: Sparse Video Tubes for Jo...	2022-12-06	Code
8	Unmasked Teacher (ViT-L)	90.6	No	Unmasked Teacher: Towards Training-Efficient Vid...	2023-03-28	Code
9	UMT-L (ViT-L/16)	90.6	No	Unmasked Teacher: Towards Training-Efficient Vid...	2023-03-28	Code
10	TubeVit-L (ImageNet-1k)	90.2	No	Rethinking Video ViTs: Sparse Video Tubes for Jo...	2022-12-06	Code
11	UniFormerV2-L (ViT-L, 336)	90	Yes	-	-	Code
12	VideoMAE V2-g (64x266x266)	90	No	VideoMAE V2: Scaling Video Masked Autoencoders w...	2023-03-29	Code
13	FluxViT-B	90	Yes	Make Your Training Flexible: Towards Deployment-...	2025-03-18	Code
14	MTV-H (WTS 60M)	89.9	No	Multiview Transformers for Video Recognition	2022-01-12	Code
15	TAdaFormer-L/14	89.9	No	Temporally-Adaptive Models for Efficient Video U...	2023-08-10	Code
16	EVA	89.7	No	EVA: Exploring the Limits of Masked Visual Repre...	2022-11-14	Code
17	AM/12 ViT-B Dinov2	89.6	No	AM Flow: Adapters for Temporal Processing in Act...	2024-11-04	-
18	ATM	89.4	No	What Can Simple Arithmetic Operations Do for Tem...	2023-07-18	Code
19	DejaVid	89.1	Yes	-	-	Code
20	CoCa (finetuned)	88.9	No	CoCa: Contrastive Captioners are Image-Text Foun...	2022-05-04	Code
21	BIKE (CLIP ViT-L/14)	88.7	No	Bidirectional Cross-Modal Knowledge Exploration ...	2022-12-31	Code
22	ILA (ViT-L/14)	88.7	No	Implicit Temporal Modeling with Learnable Alignm...	2023-04-20	Code
23	Side4Video (EVA, ViT-E/14)	88.6	No	Side4Video: Spatial-Temporal Side Network for Me...	2023-11-27	Code
24	TubeVit-B (ImageNet-1k)	88.6	No	Rethinking Video ViTs: Sparse Video Tubes for Jo...	2022-12-06	Code
25	VideoMAE V2-g	88.5	Yes	VideoMAE V2: Scaling Video Masked Autoencoders w...	2023-03-29	Code
26	ONE-PEACE	88.1	No	ONE-PEACE: Exploring One General Representation ...	2023-05-18	Code
27	FluxViT-S	88	Yes	Make Your Training Flexible: Towards Deployment-...	2025-03-18	Code
28	CoCa (frozen)	88	No	CoCa: Contrastive Captioners are Image-Text Foun...	2022-05-04	Code
29	ViT-22B	88	No	Scaling Vision Transformers to 22 Billion Parame...	2023-02-10	Code
30	Text4Vis (CLIP ViT-L/14)	87.8	No	Revisiting Classifier: Transferring Vision-Langu...	2022-07-04	Code
31	Hiera-H (no extra data)	87.8	No	Hiera: A Hierarchical Vision Transformer without...	2023-06-01	Code
32	EVL (CLIP ViT-L/14@336px, frozen, 32 frames)	87.7	No	Frozen CLIP Models are Efficient Video Learners	2022-08-06	Code
33	DualPath w/ ViT-L/14	87.7	No	Dual-path Adaptation from Image to Video Transfo...	2023-03-17	Code
34	X-CLIP(ViT-L/14, CLIP)	87.7	No	Expanding Language-Image Pretrained Models for G...	2022-08-04	Code
35	AIM (CLIP ViT-L/14, 32x224)	87.5	Yes	AIM: Adapting Image Models for Efficient Video A...	2023-02-06	Code
36	VideoMAE (no extra data, ViT-H, 32x320x320)	87.4	No	VideoMAE: Masked Autoencoders are Data-Efficient...	2022-03-23	Code
37	ST-Adapter (ViT-L, CLIP)	87.2	No	ST-Adapter: Parameter-Efficient Image-to-Video T...	2022-06-27	Code
38	ZeroI2V ViT-L/14	87.2	No	ZeroI2V: Zero-Cost Adaptation of Pre-trained Tra...	2023-10-02	Code
39	CoVeR (JFT-3B)	87.2	No	Co-training Transformer with Videos and Images I...	2021-12-14	-
40	MVD (K400 pretrain, ViT-H, 16x224x224)	87.2	No	Masked Video Distillation: Rethinking Masked Fea...	2022-12-08	Code
41	mPLUG-2	87.1	No	mPLUG-2: A Modularized Multi-modal Foundation Mo...	2023-02-01	Code
42	MaskFeat (K600, MViT-L)	87	No	Masked Feature Prediction for Self-Supervised Vi...	2021-12-16	Code
43	VicTR (ViT-L/14)	87	No	VicTR: Video-conditioned Text Representations fo...	2023-04-05	-
44	Video-SwinV2-G (ImageNet-22k and external 70M pretrain)	86.8	No	Swin Transformer V2: Scaling Up Capacity and Res...	2021-11-18	Code
45	MaskFeat (no extra data, MViT-L)	86.7	No	Masked Feature Prediction for Self-Supervised Vi...	2021-12-16	Code
46	VideoMAE (no extra data, ViT-H)	86.6	No	VideoMAE: Masked Autoencoders are Data-Efficient...	2022-03-23	Code
47	MVD (K400 pretrain, ViT-L, 16x224x224)	86.4	No	Masked Video Distillation: Rethinking Masked Fea...	2022-12-08	Code
48	TAdaConvNeXtV2-B	86.4	No	Temporally-Adaptive Models for Efficient Video U...	2023-08-10	Code
49	CoVeR (JFT-300M)	86.3	No	Co-training Transformer with Videos and Images I...	2021-12-14	-
50	VideoMAE (no extra data, ViT-L, 32x320x320)	86.1	No	VideoMAE: Masked Autoencoders are Data-Efficient...	2022-03-23	Code
51	MViTv2-L (ImageNet-21k pretrain)	86.1	Yes	MViTv2: Improved Multiscale Vision Transformers ...	2021-12-02	Code
52	ILA (ViT-B/16)	85.7	No	Implicit Temporal Modeling with Learnable Alignm...	2023-04-20	Code
53	DualPath w/ ViT-B/16	85.4	No	Dual-path Adaptation from Image to Video Transfo...	2023-03-17	Code
54	TokenLearner 16at18 (L/10)	85.4	No	TokenLearner: What Can 8 Learned Tokens Do for I...	2021-06-21	Code
55	MAR (50% mask, ViT-L, 16x4)	85.3	No	MAR: Masked Autoencoders for Efficient Action Re...	2022-07-24	Code
56	CAST(ViT-B/16)	85.3	No	CAST: Cross-Attention in Space and Time for Vide...	2023-11-30	Code
57	VideoMAE (no extra data, ViT-L, 16x4)	85.2	No	VideoMAE: Masked Autoencoders are Data-Efficient...	2022-03-23	Code
58	ViC-MAE (ViT-L)	85.1	No	ViC-MAE: Self-Supervised Representation Learning...	2023-03-21	Code
59	VideoMamba-M800	85	No	VideoMamba: State Space Model for Efficient Vide...	2024-03-11	Code
60	Swin-L (384x384, ImageNet-21k pretrain)	84.9	No	Video Swin Transformer	2021-06-24	Code
61	ViViT-H/16x2 (JFT)	84.9	No	ViViT: A Video Vision Transformer	2021-03-29	Code
62	OMNIVORE (Swin-L)	84.1	No	Omnivore: A Single Model for Many Visual Modalit...	2022-01-20	Code
63	OMNIVORE (Swin-B)	84	No	Omnivore: A Single Model for Many Visual Modalit...	2022-01-20	Code
64	MAR (75% mask, ViT-L, 16x4)	83.9	No	MAR: Masked Autoencoders for Efficient Action Re...	2022-07-24	Code
65	ActionCLIP (CLIP-pretrained)	83.8	No	ActionCLIP: A New Paradigm for Video Action Reco...	2021-09-17	Code
66	OmniSource irCSN-152 (IG-Kinetics-65M pretrain)	83.6	No	Omni-sourced Webly-supervised Learning for Video...	2020-03-29	Code
67	MVD (K400 pretrain, ViT-B, 16x224x224)	83.4	No	Masked Video Distillation: Rethinking Masked Fea...	2022-12-08	Code
68	StructViT-B-4-1	83.4	No	Learning Correlation Structures for Vision Trans...	2024-04-05	-
69	Swin-L (ImageNet-21k pretrain)	83.1	No	Video Swin Transformer	2021-06-24	Code
70	SIFA	83.1	No	Stand-Alone Inter-Frame Attention in Video Models	2022-06-14	Code
71	UniFormer-B (ImageNet-1K)	82.9	No	-	-	Code
72	irCSN-152 (IG-Kinetics-65M pretrain)	82.8	No	Large-scale weakly-supervised pre-training for v...	2019-05-02	Code
73	DirecFormer	82.75	No	DirecFormer: A Directed Attention in Transformer...	2022-03-19	Code
74	Swin-B (ImageNet-21k pretrain)	82.7	No	Video Swin Transformer	2021-06-24	Code
75	ir-CSN-152 (IG-65M pretraining)	82.6	No	Video Classification with Channel-Separated Conv...	2019-04-04	Code
76	ip-CSN-152 (IG-65M pretraining)	82.5	No	Video Classification with Channel-Separated Conv...	2019-04-04	Code
77	TPS	82.5	No	Spatiotemporal Self-attention Modeling with Temp...	2022-07-27	Code
78	ILA (ViT-B/32)	82.4	No	Implicit Temporal Modeling with Learnable Alignm...	2023-04-20	Code
79	AMD(ViT-B/16)	82.2	No	Asymmetric Masked Distillation for Pre-Training ...	2023-11-06	-
80	VATT-Large	82.1	No	VATT: Transformers for Multimodal Self-Supervise...	2021-04-22	Code
81	AdaMAE	81.7	No	AdaMAE: Adaptive Masking for Efficient Spatiotem...	2022-11-16	Code
82	VideoMAE (no extra data, ViT-B, 16x4)	81.5	No	VideoMAE: Masked Autoencoders are Data-Efficient...	2022-03-23	Code
83	MoViNet-A6	81.5	No	MoViNets: Mobile Video Networks for Efficient Vi...	2021-03-21	Code
84	MLP-3D	81.4	No	MLP-3D: A MLP-like 3D Architecture with Grouped ...	2022-06-13	-
85	R[2+1]D-152 (IG-65M pretraining)	81.3	No	Video Classification with Channel-Separated Conv...	2019-04-04	Code
86	LGD-3D Two-stream (ResNet-101)	81.2	No	Learning Spatio-Temporal Representation with Loc...	2019-06-13	-
87	MViT-B, 64x3	81.2	No	Multiscale Vision Transformers	2021-04-22	Code
88	Motionformer-HR	81.1	No	Keeping Your Eye on the Ball: Trajectory Attenti...	2021-06-09	Code
89	MVD (K400 pretrain, ViT-S, 16x224x224)	81	No	Masked Video Distillation: Rethinking Masked Fea...	2022-12-08	Code
90	MAR (50% mask, ViT-B, 16x4)	81	No	MAR: Masked Autoencoders for Efficient Action Re...	2022-07-24	Code
91	MoViNet-A5	80.9	No	MoViNets: Mobile Video Networks for Efficient Vi...	2021-03-21	Code
92	MBT (AV)	80.8	No	Attention Bottlenecks for Multimodal Fusion	2021-06-30	Code
93	TimeSformer-L	80.7	No	Is Space-Time Attention All You Need for Video U...	2021-02-09	Code
94	Swin-B (ImageNet-1k pretrain)	80.6	No	Video Swin Transformer	2021-06-24	Code
95	Swin-S (ImageNet-1k pretrain)	80.6	No	Video Swin Transformer	2021-06-24	Code
96	En-VidTr-L	80.5	No	VidTr: Video Transformer Without Convolutions	2021-04-23	-
97	MoViNet-A4	80.5	No	MoViNets: Mobile Video Networks for Efficient Vi...	2021-03-21	Code
98	OmniSource SlowOnly R101 8x8(ImageNet pretrain)	80.5	No	Omni-sourced Webly-supervised Learning for Video...	2020-03-29	Code
99	STAM (64 Frames)	80.5	No	An Image is Worth 16x16 Words, What is a Video W...	2021-03-25	Code
100	X3D-XXL	80.4	No	X3D: Expanding Architectures for Efficient Video...	2020-04-09	Code
101	R3D-RS-200	80.4	No	Revisiting 3D ResNets for Video Recognition	2021-09-03	Code
102	OmniSource SlowOnly R101 8x8 (Scratch)	80.4	No	Omni-sourced Webly-supervised Learning for Video...	2020-03-29	Code
103	MViT-B, 32x3	80.2	No	Multiscale Vision Transformers	2021-04-22	Code
104	AMD(ViT-S/16)	80.1	No	Asymmetric Masked Distillation for Pre-Training ...	2023-11-06	-
105	SlowFast 16x8 (ResNet-101 + NL)	79.8	Yes	SlowFast Networks for Video Recognition	2018-12-10	Code
106	CT-Net Ensemble	79.8	No	CT-Net: Channel Tensorization Network for Video ...	2021-06-03	Code
107	ViT-B-VTN+ ImageNet-21K (84.0 [10])	79.8	No	Video Transformer Network	2021-02-01	Code
108	TimeSformer-HR	79.7	No	Is Space-Time Attention All You Need for Video U...	2021-02-09	Code
109	En-VidTr-M	79.7	No	VidTr: Video Transformer Without Convolutions	2021-04-23	-
110	LGD-3D RGB (ResNet-101)	79.4	No	Learning Spatio-Temporal Representation with Loc...	2019-06-13	-
111	TDN-ResNet101 (ensemble, ImageNet pretrained, RGB only)	79.4	No	TDN: Temporal Difference Networks for Efficient ...	2020-12-18	Code
112	En-VidTr-S	79.4	No	VidTr: Video Transformer Without Convolutions	2021-04-23	-
113	MAR (75% mask, ViT-B, 16x4)	79.4	No	MAR: Masked Autoencoders for Efficient Action Re...	2022-07-24	Code
114	STAM (16 Frames)	79.3	No	An Image is Worth 16x16 Words, What is a Video W...	2021-03-25	Code
115	ip-CSN-152 (Sports-1M pretraining)	79.2	No	Video Classification with Channel-Separated Conv...	2019-04-04	Code
116	CorrNet	79.2	No	Video Modeling with Correlation Networks	2019-06-07	-
117	OmniVL	79.1	No	OmniVL:One Foundation Model for Image-Language a...	2022-09-15	-
118	X3D-XL	79.1	No	X3D: Expanding Architectures for Efficient Video...	2020-04-09	Code
119	MVFNet-ResNet101 (ensemble, ImageNet pretrained, RGB only)	79.1	No	MVFNet: Multi-View Fusion Network for Efficient ...	2020-12-13	Code
120	TAdaConvNeXt-T	79.1	No	TAda! Temporally-Adaptive Convolutions for Video...	2021-10-12	Code
121	SlowFast 16x8 (ResNet-101)	78.9	No	SlowFast Networks for Video Recognition	2018-12-10	Code
122	G-Blend (Sports-1M pretrain)	78.9	No	What Makes Training Multi-Modal Classification N...	2019-05-29	Code
123	Swin-T (ImageNet-1k pretrain)	78.8	No	Video Swin Transformer	2021-06-24	Code
124	GB + DF + LB (ResNet 152, ImageNet pretrained)	78.8	No	Action recognition with spatial-temporal discrim...	2019-08-20	-
125	ViT-B-VTN (3 layers, ImageNet pretrain)	78.6	No	Video Transformer Network	2021-02-01	Code
126	MViT-B, 16x4	78.4	No	Multiscale Vision Transformers	2021-04-22	Code
127	MoViNet-A3	78.2	No	MoViNets: Mobile Video Networks for Efficient Vi...	2021-03-21	Code
128	TAda2D-En (ResNet-50, 8+16 frames)	78.2	No	TAda! Temporally-Adaptive Convolutions for Video...	2021-10-12	Code
129	SVT	78.1	No	Self-supervised Video Transformer	2021-12-02	Code
130	TimeSformer	78	No	Is Space-Time Attention All You Need for Video U...	2021-02-09	Code
131	SlowFast 8x8 (ResNet-101)	77.9	No	SlowFast Networks for Video Recognition	2018-12-10	Code
132	RepFlow-50 ([2+1]D CNN, FcF, Non-local block)	77.9	No	Representation Flow for Action Recognition	2018-10-02	Code
133	ip-CSN-152	77.8	No	Video Classification with Channel-Separated Conv...	2019-04-04	Code
134	I3D + NL	77.7	No	Non-local Neural Networks	2017-11-21	Code
135	G-Blend	77.7	No	What Makes Training Multi-Modal Classification N...	2019-05-29	Code
136	HATNet (32 frames)	77.6	No	Large Scale Holistic Video Understanding	2019-04-25	Code
137	X3D-L	77.5	No	X3D: Expanding Architectures for Efficient Video...	2020-04-09	Code
138	CoST ResNet-101 (ImageNet pretrain)	77.5	No	-	-	Code
139	TAda2D (ResNet-50, 16 frames)	77.4	No	TAda! Temporally-Adaptive Convolutions for Video...	2021-10-12	Code
140	EvaNet	77.4	No	Evolving Space-Time Neural Architectures for Vid...	2018-11-26	-
141	RNL+TSM Ensemble(ResNet50, 8 + 16 frames)	77.4	No	Region-based Non-local Operation for Video Class...	2020-07-17	Code
142	VIMPAC	77.4	No	VIMPAC: Video Pre-Training via Masked Token Pred...	2021-06-21	Code
143	BQN (ResNet-50)	77.3	No	Busy-Quiet Video Disentangling for Video Classif...	2021-03-29	Code
144	S3D-G (RGB+Flow, ImageNet pretrained)	77.2	No	Rethinking Spatiotemporal Feature Learning: Spee...	2017-12-13	Code
145	SlowFast 8x8 (ResNet-50)	77	No	SlowFast Networks for Video Recognition	2018-12-10	Code
146	TAda2D (ResNet-50, 8 frames)	76.7	No	TAda! Temporally-Adaptive Convolutions for Video...	2021-10-12	Code
147	D3D+S3D-G (RGB + RGB)	76.5	No	D3D: Distilled 3D Networks for Video Action Reco...	2018-12-19	Code
148	MSNet-R50 (16 frames, ImageNet pretrained)	76.4	No	MotionSqueeze: Neural Motion Feature Learning fo...	2020-07-20	Code
149	GloRe	76.1	No	Global Textual Relation Embedding for Relational...	2019-06-03	Code
150	X3D-M	76	No	X3D: Expanding Architectures for Efficient Video...	2020-04-09	Code
151	MViT-S	76	No	Multiscale Vision Transformers	2021-04-22	Code
152	CMA iter1 (16 frames)	75.98	No	Two-Stream Video Classification with Cross-Modal...	2019-08-01	-
153	D3D (RGB)	75.9	No	D3D: Distilled 3D Networks for Video Action Reco...	2018-12-19	Code
154	Oct-I3D + NL	75.7	No	Drop an Octave: Reducing Spatial Redundancy in C...	2019-04-10	Code
155	SlowFast 4x16 (ResNet-50)	75.6	No	SlowFast Networks for Video Recognition	2018-12-10	Code
156	R[2+1]D-Flow (Sports-1M pretrain)	75.4	No	A Closer Look at Spatiotemporal Convolutions for...	2017-11-30	Code
157	FASTER32	75.1	No	FASTER Recurrent Networks for Efficient Video Cl...	2019-06-10	-
158	MoViNet-A2	75	No	MoViNets: Mobile Video Networks for Efficient Vi...	2021-03-21	Code
159	MARS+RGB+Flow (64 frames)	74.9	No	-	-	Code
160	S3D-G (RGB, ImageNet pretrained)	74.7	No	Rethinking Spatiotemporal Feature Learning: Spee...	2017-12-13	Code
161	TSM	74.7	No	TSM: Temporal Shift Module for Efficient Video U...	2018-11-20	Code
162	A2 Net	74.6	No	$A^2$-Nets: Double Attention Networks	2018-10-27	-
163	R[2+1]D-RGB (Sports-1M pretrain)	74.3	No	A Closer Look at Spatiotemporal Convolutions for...	2017-11-30	Code
164	TSN	73.9	No	ConvNet Architecture Search for Spatiotemporal F...	2017-08-16	Code
165	R[2+1]D-Two-Stream	73.9	No	A Closer Look at Spatiotemporal Convolutions for...	2017-11-30	Code
166	TSN	73.9	No	ConvNet Architecture Search for Spatiotemporal F...	2017-08-16	Code
167	STM (ResNet-50)	73.7	No	STM: SpatioTemporal and Motion Encoding for Acti...	2019-08-07	-
168	bLVNet Fan et al. (2019)	73.5	No	More Is Less: Learning Efficient Video Represent...	2019-12-02	Code
169	Co Slow_64	73.05	No	Continual 3D Convolutional Neural Networks for R...	2021-05-31	Code
170	Inception-ResNet	73	No	Revisiting the Effectiveness of Off-the-shelf Te...	2017-08-12	-
171	MFNet	72.8	No	Multi-Fiber Networks for Video Recognition	2018-07-30	-
172	MoViNet-A1	72.7	No	MoViNets: Mobile Video Networks for Efficient Vi...	2021-03-21	Code
173	ARTNet	72.4	No	Appearance-and-Relation Networks for Video Class...	2017-11-24	Code
174	LGD-3D Flow (ResNet-101)	72.3	No	Learning Spatio-Temporal Representation with Loc...	2019-06-13	-
175	R[2+1]D	72	No	A Closer Look at Spatiotemporal Convolutions for...	2017-11-30	Code
176	R[2+1]D-RGB	72	No	A Closer Look at Spatiotemporal Convolutions for...	2017-11-30	Code
177	FASTER16 w/o sp	71.7	No	FASTER Recurrent Networks for Efficient Video Cl...	2019-06-10	-
178	Co X3D-L_64	71.61	No	Continual 3D Convolutional Neural Networks for R...	2021-05-31	Code
179	I3D	71.1	No	Quo Vadis, Action Recognition? A New Model and t...	2017-05-22	Code
180	Co X3D-M_64	71.03	No	Continual 3D Convolutional Neural Networks for R...	2021-05-31	Code
181	X3D-L	69.29	No	Continual 3D Convolutional Neural Networks for R...	2021-05-31	Code
182	MARS+RGB+Flow (16 frames)	68.9	No	-	-	Code
183	SlowFast-8×8-R50	68.45	No	Continual 3D Convolutional Neural Networks for R...	2021-05-31	Code
184	S3D-G (Flow, ImageNet pretrained)	68	No	Rethinking Spatiotemporal Feature Learning: Spee...	2017-12-13	Code
185	R[2+1]D-Flow	67.5	No	A Closer Look at Spatiotemporal Convolutions for...	2017-11-30	Code
186	Slow-8x8-R50	67.42	No	Continual 3D Convolutional Neural Networks for R...	2021-05-31	Code
187	Co X3D-S_64	67.33	No	Continual 3D Convolutional Neural Networks for R...	2021-05-31	Code
188	X3D-M	67.24	No	Continual 3D Convolutional Neural Networks for R...	2021-05-31	Code
189	SlowFast-4×16-R50	67.06	No	Continual 3D Convolutional Neural Networks for R...	2021-05-31	Code
190	Co Slow_8	65.9	No	Continual 3D Convolutional Neural Networks for R...	2021-05-31	Code
191	MoViNet-A0	65.8	No	MoViNets: Mobile Video Networks for Efficient Vi...	2021-03-21	Code
192	X3D-S	64.71	No	Continual 3D Convolutional Neural Networks for R...	2021-05-31	Code
193	I3D-R50	63.98	No	Continual 3D Convolutional Neural Networks for R...	2021-05-31	Code
194	Co X3D-L_16	63.03	No	Continual 3D Convolutional Neural Networks for R...	2021-05-31	Code
195	Co X3D-M_16	62.8	No	Continual 3D Convolutional Neural Networks for R...	2021-05-31	Code
196	Co X3D-S_13	60.18	No	Continual 3D Convolutional Neural Networks for R...	2021-05-31	Code
197	Co I3D_8	59.58	No	Continual 3D Convolutional Neural Networks for R...	2021-05-31	Code
198	R(2+1)D-18_16	59.52	No	Continual 3D Convolutional Neural Networks for R...	2021-05-31	Code
199	X3D-XS	59.37	No	Continual 3D Convolutional Neural Networks for R...	2021-05-31	Code
200	Co I3D_64	56.86	No	Continual 3D Convolutional Neural Networks for R...	2021-05-31	Code
201	R(2+1)D-18_8	53.52	No	Continual 3D Convolutional Neural Networks for R...	2021-05-31	Code
202	RCU_8	53.4	No	Continual 3D Convolutional Neural Networks for R...	2021-05-31	Code

#1OmniVec2
93.6
Acc@1
No paper
#2FTP-UniFormerV2-L/14SOTA
93.4
Acc@1· 2024-03-24
Enhancing Video Transformers for Action Understanding with VLM-aided Training
#3InternVideo2-6BSOTA
92.1
Acc@1· Extra Data· 2024-03-22
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding Code
#4InternVideo2-1B
91.6
Acc@1· Extra Data· 2024-03-22
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding Code
#5InternVideoSOTA
91.1
Acc@1· 2022-12-06
InternVideo: General Video Foundation Models via Generative and Discriminative Learning Code
#6OmniVec
91.1
Acc@1· 2023-11-07
OmniVec: Learning robust representations with cross modal sharing
#7TubeViT-H (ImageNet-1k)
90.9
Acc@1· 2022-12-06
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning Code
#8Unmasked Teacher (ViT-L)
90.6
Acc@1· 2023-03-28
Unmasked Teacher: Towards Training-Efficient Video Foundation Models Code
#9UMT-L (ViT-L/16)
90.6
Acc@1· 2023-03-28
Unmasked Teacher: Towards Training-Efficient Video Foundation Models Code
#10TubeVit-L (ImageNet-1k)
90.2
Acc@1· 2022-12-06
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning Code
#11UniFormerV2-L (ViT-L, 336)
90
Acc@1· Extra Data
No paperCode
#12VideoMAE V2-g (64x266x266)
90
Acc@1· 2023-03-29
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking Code
#13FluxViT-B
90
Acc@1· Extra Data· 2025-03-18
Make Your Training Flexible: Towards Deployment-Efficient Video Models Code
#14MTV-H (WTS 60M)SOTA
89.9
Acc@1· 2022-01-12
Multiview Transformers for Video Recognition Code
#15TAdaFormer-L/14
89.9
Acc@1· 2023-08-10
Temporally-Adaptive Models for Efficient Video Understanding Code
#16EVA
89.7
Acc@1· 2022-11-14
EVA: Exploring the Limits of Masked Visual Representation Learning at Scale Code
#17AM/12 ViT-B Dinov2
89.6
Acc@1· 2024-11-04
AM Flow: Adapters for Temporal Processing in Action Recognition
#18ATM
89.4
Acc@1· 2023-07-18
What Can Simple Arithmetic Operations Do for Temporal Modeling?Code
#19DejaVid
89.1
Acc@1· Extra Data
No paperCode
#20CoCa (finetuned)
88.9
Acc@1· 2022-05-04
CoCa: Contrastive Captioners are Image-Text Foundation Models Code
#21BIKE (CLIP ViT-L/14)
88.7
Acc@1· 2022-12-31
Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models Code
#22ILA (ViT-L/14)
88.7
Acc@1· 2023-04-20
Implicit Temporal Modeling with Learnable Alignment for Video Recognition Code
#23Side4Video (EVA, ViT-E/14)
88.6
Acc@1· 2023-11-27
Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning Code
#24TubeVit-B (ImageNet-1k)
88.6
Acc@1· 2022-12-06
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning Code
#25VideoMAE V2-g
88.5
Acc@1· Extra Data· 2023-03-29
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking Code
#26ONE-PEACE
88.1
Acc@1· 2023-05-18
ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities Code
#27FluxViT-S
88
Acc@1· Extra Data· 2025-03-18
Make Your Training Flexible: Towards Deployment-Efficient Video Models Code
#28CoCa (frozen)
88
Acc@1· 2022-05-04
CoCa: Contrastive Captioners are Image-Text Foundation Models Code
#29ViT-22B
88
Acc@1· 2023-02-10
Scaling Vision Transformers to 22 Billion Parameters Code
#30Text4Vis (CLIP ViT-L/14)
87.8
Acc@1· 2022-07-04
Revisiting Classifier: Transferring Vision-Language Models for Video Recognition Code
#31Hiera-H (no extra data)
87.8
Acc@1· 2023-06-01
Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles Code
#32EVL (CLIP ViT-L/14@336px, frozen, 32 frames)
87.7
Acc@1· 2022-08-06
Frozen CLIP Models are Efficient Video Learners Code
#33DualPath w/ ViT-L/14
87.7
Acc@1· 2023-03-17
Dual-path Adaptation from Image to Video Transformers Code
#34X-CLIP(ViT-L/14, CLIP)
87.7
Acc@1· 2022-08-04
Expanding Language-Image Pretrained Models for General Video Recognition Code
#35AIM (CLIP ViT-L/14, 32x224)
87.5
Acc@1· Extra Data· 2023-02-06
AIM: Adapting Image Models for Efficient Video Action Recognition Code
#36VideoMAE (no extra data, ViT-H, 32x320x320)
87.4
Acc@1· 2022-03-23
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training Code
#37ST-Adapter (ViT-L, CLIP)
87.2
Acc@1· 2022-06-27
ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning Code
#38ZeroI2V ViT-L/14
87.2
Acc@1· 2023-10-02
ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to Video Code
#39CoVeR (JFT-3B)SOTA
87.2
Acc@1· 2021-12-14
Co-training Transformer with Videos and Images Improves Action Recognition
#40MVD (K400 pretrain, ViT-H, 16x224x224)
87.2
Acc@1· 2022-12-08
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning Code
#41mPLUG-2
87.1
Acc@1· 2023-02-01
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video Code
#42MaskFeat (K600, MViT-L)
87
Acc@1· 2021-12-16
Masked Feature Prediction for Self-Supervised Visual Pre-Training Code
#43VicTR (ViT-L/14)
87
Acc@1· 2023-04-05
VicTR: Video-conditioned Text Representations for Activity Recognition
#44Video-SwinV2-G (ImageNet-22k and external 70M pretrain)SOTA
86.8
Acc@1· 2021-11-18
Swin Transformer V2: Scaling Up Capacity and Resolution Code
#45MaskFeat (no extra data, MViT-L)
86.7
Acc@1· 2021-12-16
Masked Feature Prediction for Self-Supervised Visual Pre-Training Code
#46VideoMAE (no extra data, ViT-H)
86.6
Acc@1· 2022-03-23
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training Code
#47MVD (K400 pretrain, ViT-L, 16x224x224)
86.4
Acc@1· 2022-12-08
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning Code
#48TAdaConvNeXtV2-B
86.4
Acc@1· 2023-08-10
Temporally-Adaptive Models for Efficient Video Understanding Code
#49CoVeR (JFT-300M)
86.3
Acc@1· 2021-12-14
Co-training Transformer with Videos and Images Improves Action Recognition
#50VideoMAE (no extra data, ViT-L, 32x320x320)
86.1
Acc@1· 2022-03-23
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training Code
#51MViTv2-L (ImageNet-21k pretrain)
86.1
Acc@1· Extra Data· 2021-12-02
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection Code
#52ILA (ViT-B/16)
85.7
Acc@1· 2023-04-20
Implicit Temporal Modeling with Learnable Alignment for Video Recognition Code
#53DualPath w/ ViT-B/16
85.4
Acc@1· 2023-03-17
Dual-path Adaptation from Image to Video Transformers Code
#54TokenLearner 16at18 (L/10)SOTA
85.4
Acc@1· 2021-06-21
TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?Code
#55MAR (50% mask, ViT-L, 16x4)
85.3
Acc@1· 2022-07-24
MAR: Masked Autoencoders for Efficient Action Recognition Code
#56CAST(ViT-B/16)
85.3
Acc@1· 2023-11-30
CAST: Cross-Attention in Space and Time for Video Action Recognition Code
#57VideoMAE (no extra data, ViT-L, 16x4)
85.2
Acc@1· 2022-03-23
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training Code
#58ViC-MAE (ViT-L)
85.1
Acc@1· 2023-03-21
ViC-MAE: Self-Supervised Representation Learning from Images and Video with Contrastive Masked Autoencoders Code
#59VideoMamba-M800
85
Acc@1· 2024-03-11
VideoMamba: State Space Model for Efficient Video Understanding Code
#60Swin-L (384x384, ImageNet-21k pretrain)
84.9
Acc@1· 2021-06-24
Video Swin Transformer Code
#61ViViT-H/16x2 (JFT)SOTA
84.9
Acc@1· 2021-03-29
ViViT: A Video Vision Transformer Code
#62OMNIVORE (Swin-L)
84.1
Acc@1· 2022-01-20
Omnivore: A Single Model for Many Visual Modalities Code
#63OMNIVORE (Swin-B)
84
Acc@1· 2022-01-20
Omnivore: A Single Model for Many Visual Modalities Code
#64MAR (75% mask, ViT-L, 16x4)
83.9
Acc@1· 2022-07-24
MAR: Masked Autoencoders for Efficient Action Recognition Code
#65ActionCLIP (CLIP-pretrained)
83.8
Acc@1· 2021-09-17
ActionCLIP: A New Paradigm for Video Action Recognition Code
#66OmniSource irCSN-152 (IG-Kinetics-65M pretrain)SOTA
83.6
Acc@1· 2020-03-29
Omni-sourced Webly-supervised Learning for Video Recognition Code
#67MVD (K400 pretrain, ViT-B, 16x224x224)
83.4
Acc@1· 2022-12-08
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning Code
#68StructViT-B-4-1
83.4
Acc@1· 2024-04-05
Learning Correlation Structures for Vision Transformers
#69Swin-L (ImageNet-21k pretrain)
83.1
Acc@1· 2021-06-24
Video Swin Transformer Code
#70SIFA
83.1
Acc@1· 2022-06-14
Stand-Alone Inter-Frame Attention in Video Models Code
#71UniFormer-B (ImageNet-1K)
82.9
Acc@1
No paperCode
#72irCSN-152 (IG-Kinetics-65M pretrain)SOTA
82.8
Acc@1· 2019-05-02
Large-scale weakly-supervised pre-training for video action recognition Code
#73DirecFormer
82.75
Acc@1· 2022-03-19
DirecFormer: A Directed Attention in Transformer Approach to Robust Action Recognition Code
#74Swin-B (ImageNet-21k pretrain)
82.7
Acc@1· 2021-06-24
Video Swin Transformer Code
#75ir-CSN-152 (IG-65M pretraining)SOTA
82.6
Acc@1· 2019-04-04
Video Classification with Channel-Separated Convolutional Networks Code
#76ip-CSN-152 (IG-65M pretraining)
82.5
Acc@1· 2019-04-04
Video Classification with Channel-Separated Convolutional Networks Code
#77TPS
82.5
Acc@1· 2022-07-27
Spatiotemporal Self-attention Modeling with Temporal Patch Shift for Action Recognition Code
#78ILA (ViT-B/32)
82.4
Acc@1· 2023-04-20
Implicit Temporal Modeling with Learnable Alignment for Video Recognition Code
#79AMD(ViT-B/16)
82.2
Acc@1· 2023-11-06
Asymmetric Masked Distillation for Pre-Training Small Foundation Models
#80VATT-Large
82.1
Acc@1· 2021-04-22
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text Code
#81AdaMAE
81.7
Acc@1· 2022-11-16
AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning with Masked Autoencoders Code
#82VideoMAE (no extra data, ViT-B, 16x4)
81.5
Acc@1· 2022-03-23
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training Code
#83MoViNet-A6
81.5
Acc@1· 2021-03-21
MoViNets: Mobile Video Networks for Efficient Video Recognition Code
#84MLP-3D
81.4
Acc@1· 2022-06-13
MLP-3D: A MLP-like 3D Architecture with Grouped Time Mixing
#85R[2+1]D-152 (IG-65M pretraining)
81.3
Acc@1· 2019-04-04
Video Classification with Channel-Separated Convolutional Networks Code
#86LGD-3D Two-stream (ResNet-101)
81.2
Acc@1· 2019-06-13
Learning Spatio-Temporal Representation with Local and Global Diffusion
#87MViT-B, 64x3
81.2
Acc@1· 2021-04-22
Multiscale Vision Transformers Code
#88Motionformer-HR
81.1
Acc@1· 2021-06-09
Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers Code
#89MVD (K400 pretrain, ViT-S, 16x224x224)
81
Acc@1· 2022-12-08
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning Code
#90MAR (50% mask, ViT-B, 16x4)
81
Acc@1· 2022-07-24
MAR: Masked Autoencoders for Efficient Action Recognition Code
#91MoViNet-A5
80.9
Acc@1· 2021-03-21
MoViNets: Mobile Video Networks for Efficient Video Recognition Code
#92MBT (AV)
80.8
Acc@1· 2021-06-30
Attention Bottlenecks for Multimodal Fusion Code
#93TimeSformer-L
80.7
Acc@1· 2021-02-09
Is Space-Time Attention All You Need for Video Understanding?Code
#94Swin-B (ImageNet-1k pretrain)
80.6
Acc@1· 2021-06-24
Video Swin Transformer Code
#95Swin-S (ImageNet-1k pretrain)
80.6
Acc@1· 2021-06-24
Video Swin Transformer Code
#96En-VidTr-L
80.5
Acc@1· 2021-04-23
VidTr: Video Transformer Without Convolutions
#97MoViNet-A4
80.5
Acc@1· 2021-03-21
MoViNets: Mobile Video Networks for Efficient Video Recognition Code
#98OmniSource SlowOnly R101 8x8(ImageNet pretrain)
80.5
Acc@1· 2020-03-29
Omni-sourced Webly-supervised Learning for Video Recognition Code
#99STAM (64 Frames)
80.5
Acc@1· 2021-03-25
An Image is Worth 16x16 Words, What is a Video Worth?Code
#100X3D-XXL
80.4
Acc@1· 2020-04-09
X3D: Expanding Architectures for Efficient Video Recognition Code
#101R3D-RS-200
80.4
Acc@1· 2021-09-03
Revisiting 3D ResNets for Video Recognition Code
#102OmniSource SlowOnly R101 8x8 (Scratch)
80.4
Acc@1· 2020-03-29
Omni-sourced Webly-supervised Learning for Video Recognition Code
#103MViT-B, 32x3
80.2
Acc@1· 2021-04-22
Multiscale Vision Transformers Code
#104AMD(ViT-S/16)
80.1
Acc@1· 2023-11-06
Asymmetric Masked Distillation for Pre-Training Small Foundation Models
#105SlowFast 16x8 (ResNet-101 + NL)SOTA
79.8
Acc@1· Extra Data· 2018-12-10
SlowFast Networks for Video Recognition Code
#106CT-Net Ensemble
79.8
Acc@1· 2021-06-03
CT-Net: Channel Tensorization Network for Video Classification Code
#107ViT-B-VTN+ ImageNet-21K (84.0 [10])
79.8
Acc@1· 2021-02-01
Video Transformer Network Code
#108TimeSformer-HR
79.7
Acc@1· 2021-02-09
Is Space-Time Attention All You Need for Video Understanding?Code
#109En-VidTr-M
79.7
Acc@1· 2021-04-23
VidTr: Video Transformer Without Convolutions
#110LGD-3D RGB (ResNet-101)
79.4
Acc@1· 2019-06-13
Learning Spatio-Temporal Representation with Local and Global Diffusion
#111TDN-ResNet101 (ensemble, ImageNet pretrained, RGB only)
79.4
Acc@1· 2020-12-18
TDN: Temporal Difference Networks for Efficient Action Recognition Code
#112En-VidTr-S
79.4
Acc@1· 2021-04-23
VidTr: Video Transformer Without Convolutions
#113MAR (75% mask, ViT-B, 16x4)
79.4
Acc@1· 2022-07-24
MAR: Masked Autoencoders for Efficient Action Recognition Code
#114STAM (16 Frames)
79.3
Acc@1· 2021-03-25
An Image is Worth 16x16 Words, What is a Video Worth?Code
#115ip-CSN-152 (Sports-1M pretraining)
79.2
Acc@1· 2019-04-04
Video Classification with Channel-Separated Convolutional Networks Code
#116CorrNet
79.2
Acc@1· 2019-06-07
Video Modeling with Correlation Networks
#117OmniVL
79.1
Acc@1· 2022-09-15
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks
#118X3D-XL
79.1
Acc@1· 2020-04-09
X3D: Expanding Architectures for Efficient Video Recognition Code
#119MVFNet-ResNet101 (ensemble, ImageNet pretrained, RGB only)
79.1
Acc@1· 2020-12-13
MVFNet: Multi-View Fusion Network for Efficient Video Recognition Code
#120TAdaConvNeXt-T
79.1
Acc@1· 2021-10-12
TAda! Temporally-Adaptive Convolutions for Video Understanding Code
#121SlowFast 16x8 (ResNet-101)
78.9
Acc@1· 2018-12-10
SlowFast Networks for Video Recognition Code
#122G-Blend (Sports-1M pretrain)
78.9
Acc@1· 2019-05-29
What Makes Training Multi-Modal Classification Networks Hard?Code
#123Swin-T (ImageNet-1k pretrain)
78.8
Acc@1· 2021-06-24
Video Swin Transformer Code
#124GB + DF + LB (ResNet 152, ImageNet pretrained)
78.8
Acc@1· 2019-08-20
Action recognition with spatial-temporal discriminative filter banks
#125ViT-B-VTN (3 layers, ImageNet pretrain)
78.6
Acc@1· 2021-02-01
Video Transformer Network Code
#126MViT-B, 16x4
78.4
Acc@1· 2021-04-22
Multiscale Vision Transformers Code
#127MoViNet-A3
78.2
Acc@1· 2021-03-21
MoViNets: Mobile Video Networks for Efficient Video Recognition Code
#128TAda2D-En (ResNet-50, 8+16 frames)
78.2
Acc@1· 2021-10-12
TAda! Temporally-Adaptive Convolutions for Video Understanding Code
#129SVT
78.1
Acc@1· 2021-12-02
Self-supervised Video Transformer Code
#130TimeSformer
78
Acc@1· 2021-02-09
Is Space-Time Attention All You Need for Video Understanding?Code
#131SlowFast 8x8 (ResNet-101)
77.9
Acc@1· 2018-12-10
SlowFast Networks for Video Recognition Code
#132RepFlow-50 ([2+1]D CNN, FcF, Non-local block)SOTA
77.9
Acc@1· 2018-10-02
Representation Flow for Action Recognition Code
#133ip-CSN-152
77.8
Acc@1· 2019-04-04
Video Classification with Channel-Separated Convolutional Networks Code
#134I3D + NLSOTA
77.7
Acc@1· 2017-11-21
Non-local Neural Networks Code
#135G-Blend
77.7
Acc@1· 2019-05-29
What Makes Training Multi-Modal Classification Networks Hard?Code
#136HATNet (32 frames)
77.6
Acc@1· 2019-04-25
Large Scale Holistic Video Understanding Code
#137X3D-L
77.5
Acc@1· 2020-04-09
X3D: Expanding Architectures for Efficient Video Recognition Code
#138CoST ResNet-101 (ImageNet pretrain)
77.5
Acc@1
No paperCode
#139TAda2D (ResNet-50, 16 frames)
77.4
Acc@1· 2021-10-12
TAda! Temporally-Adaptive Convolutions for Video Understanding Code
#140EvaNet
77.4
Acc@1· 2018-11-26
Evolving Space-Time Neural Architectures for Videos
#141RNL+TSM Ensemble(ResNet50, 8 + 16 frames)
77.4
Acc@1· 2020-07-17
Region-based Non-local Operation for Video Classification Code
#142VIMPAC
77.4
Acc@1· 2021-06-21
VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning Code
#143BQN (ResNet-50)
77.3
Acc@1· 2021-03-29
Busy-Quiet Video Disentangling for Video Classification Code
#144S3D-G (RGB+Flow, ImageNet pretrained)
77.2
Acc@1· 2017-12-13
Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification Code
#145SlowFast 8x8 (ResNet-50)
77
Acc@1· 2018-12-10
SlowFast Networks for Video Recognition Code
#146TAda2D (ResNet-50, 8 frames)
76.7
Acc@1· 2021-10-12
TAda! Temporally-Adaptive Convolutions for Video Understanding Code
#147D3D+S3D-G (RGB + RGB)
76.5
Acc@1· 2018-12-19
D3D: Distilled 3D Networks for Video Action Recognition Code
#148MSNet-R50 (16 frames, ImageNet pretrained)
76.4
Acc@1· 2020-07-20
MotionSqueeze: Neural Motion Feature Learning for Video Understanding Code
#149GloRe
76.1
Acc@1· 2019-06-03
Global Textual Relation Embedding for Relational Understanding Code
#150X3D-M
76
Acc@1· 2020-04-09
X3D: Expanding Architectures for Efficient Video Recognition Code
#151MViT-S
76
Acc@1· 2021-04-22
Multiscale Vision Transformers Code
#152CMA iter1 (16 frames)
75.98
Acc@1· 2019-08-01
Two-Stream Video Classification with Cross-Modality Attention
#153D3D (RGB)
75.9
Acc@1· 2018-12-19
D3D: Distilled 3D Networks for Video Action Recognition Code
#154Oct-I3D + NL
75.7
Acc@1· 2019-04-10
Drop an Octave: Reducing Spatial Redundancy in Convolutional Neural Networks with Octave Convolution Code
#155SlowFast 4x16 (ResNet-50)
75.6
Acc@1· 2018-12-10
SlowFast Networks for Video Recognition Code
#156R[2+1]D-Flow (Sports-1M pretrain)
75.4
Acc@1· 2017-11-30
A Closer Look at Spatiotemporal Convolutions for Action Recognition Code
#157FASTER32
75.1
Acc@1· 2019-06-10
FASTER Recurrent Networks for Efficient Video Classification
#158MoViNet-A2
75
Acc@1· 2021-03-21
MoViNets: Mobile Video Networks for Efficient Video Recognition Code
#159MARS+RGB+Flow (64 frames)
74.9
Acc@1
No paperCode
#160S3D-G (RGB, ImageNet pretrained)
74.7
Acc@1· 2017-12-13
Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification Code
#161TSM
74.7
Acc@1· 2018-11-20
TSM: Temporal Shift Module for Efficient Video Understanding Code
#162A2 Net
74.6
Acc@1· 2018-10-27
$A^2$-Nets: Double Attention Networks
#163R[2+1]D-RGB (Sports-1M pretrain)
74.3
Acc@1· 2017-11-30
A Closer Look at Spatiotemporal Convolutions for Action Recognition Code
#164TSNSOTA
73.9
Acc@1· 2017-08-16
ConvNet Architecture Search for Spatiotemporal Feature Learning Code
#165R[2+1]D-Two-Stream
73.9
Acc@1· 2017-11-30
A Closer Look at Spatiotemporal Convolutions for Action Recognition Code
#166TSN
73.9
Acc@1· 2017-08-16
ConvNet Architecture Search for Spatiotemporal Feature Learning Code
#167STM (ResNet-50)
73.7
Acc@1· 2019-08-07
STM: SpatioTemporal and Motion Encoding for Action Recognition
#168bLVNet Fan et al. (2019)
73.5
Acc@1· 2019-12-02
More Is Less: Learning Efficient Video Representations by Big-Little Network and Depthwise Temporal Aggregation Code
#169Co Slow_64
73.05
Acc@1· 2021-05-31
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos Code
#170Inception-ResNetSOTA
73
Acc@1· 2017-08-12
Revisiting the Effectiveness of Off-the-shelf Temporal Modeling Approaches for Large-scale Video Classification
#171MFNet
72.8
Acc@1· 2018-07-30
Multi-Fiber Networks for Video Recognition
#172MoViNet-A1
72.7
Acc@1· 2021-03-21
MoViNets: Mobile Video Networks for Efficient Video Recognition Code
#173ARTNet
72.4
Acc@1· 2017-11-24
Appearance-and-Relation Networks for Video Classification Code
#174LGD-3D Flow (ResNet-101)
72.3
Acc@1· 2019-06-13
Learning Spatio-Temporal Representation with Local and Global Diffusion
#175R[2+1]D
72
Acc@1· 2017-11-30
A Closer Look at Spatiotemporal Convolutions for Action Recognition Code
#176R[2+1]D-RGB
72
Acc@1· 2017-11-30
A Closer Look at Spatiotemporal Convolutions for Action Recognition Code
#177FASTER16 w/o sp
71.7
Acc@1· 2019-06-10
FASTER Recurrent Networks for Efficient Video Classification
#178Co X3D-L_64
71.61
Acc@1· 2021-05-31
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos Code
#179I3DSOTA
71.1
Acc@1· 2017-05-22
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset Code
#180Co X3D-M_64
71.03
Acc@1· 2021-05-31
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos Code
#181X3D-L
69.29
Acc@1· 2021-05-31
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos Code
#182MARS+RGB+Flow (16 frames)
68.9
Acc@1
No paperCode
#183SlowFast-8×8-R50
68.45
Acc@1· 2021-05-31
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos Code
#184S3D-G (Flow, ImageNet pretrained)
68
Acc@1· 2017-12-13
Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification Code
#185R[2+1]D-Flow
67.5
Acc@1· 2017-11-30
A Closer Look at Spatiotemporal Convolutions for Action Recognition Code
#186Slow-8x8-R50
67.42
Acc@1· 2021-05-31
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos Code
#187Co X3D-S_64
67.33
Acc@1· 2021-05-31
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos Code
#188X3D-M
67.24
Acc@1· 2021-05-31
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos Code
#189SlowFast-4×16-R50
67.06
Acc@1· 2021-05-31
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos Code
#190Co Slow_8
65.9
Acc@1· 2021-05-31
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos Code
#191MoViNet-A0
65.8
Acc@1· 2021-03-21
MoViNets: Mobile Video Networks for Efficient Video Recognition Code
#192X3D-S
64.71
Acc@1· 2021-05-31
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos Code
#193I3D-R50
63.98
Acc@1· 2021-05-31
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos Code
#194Co X3D-L_16
63.03
Acc@1· 2021-05-31
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos Code
#195Co X3D-M_16
62.8
Acc@1· 2021-05-31
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos Code
#196Co X3D-S_13
60.18
Acc@1· 2021-05-31
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos Code
#197Co I3D_8
59.58
Acc@1· 2021-05-31
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos Code
#198R(2+1)D-18_16
59.52
Acc@1· 2021-05-31
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos Code
#199X3D-XS
59.37
Acc@1· 2021-05-31
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos Code
#200Co I3D_64
56.86
Acc@1· 2021-05-31
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos Code
#201R(2+1)D-18_8
53.52
Acc@1· 2021-05-31
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos Code
#202RCU_8
53.4
Acc@1· 2021-05-31
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos Code