Action Recognition on AVA v2.2

Metric: mAP (higher is better)

LeaderboardDataset

Loading chart...

Results

Hide extra data

Sort:

#	Model↕	mAP▼	Extra Data	Paper	Date↕	Code
1	LART (Hiera-H, K700 PT+FT)	45.1	Yes	On the Benefits of 3D Pose and Tracking for Huma...	2023-04-03	Code
2	Hiera-H (K700 PT+FT)	43.3	Yes	Hiera: A Hierarchical Vision Transformer without...	2023-06-01	Code
3	VideoMAE V2-g	42.6	Yes	VideoMAE V2: Scaling Video Masked Autoencoders w...	2023-03-29	Code
4	STAR/L	41.7	Yes	End-to-End Spatio-Temporal Action Localisation w...	2023-04-24	-
5	MVD (Kinetics400 pretrain+finetune, ViT-H, 16x4)	41.1	Yes	Masked Video Distillation: Rethinking Masked Fea...	2022-12-08	Code
6	InternVideo	41.01	Yes	InternVideo: General Video Foundation Models via...	2022-12-06	Code
7	MVD (Kinetics400 pretrain, ViT-H, 16x4)	40.1	Yes	Masked Video Distillation: Rethinking Masked Fea...	2022-12-08	Code
8	MaskFeat (Kinetics-600 pretrain, MViT-L)	39.8	Yes	Masked Feature Prediction for Self-Supervised Vi...	2021-12-16	Code
9	UMT-L (ViT-L/16)	39.8	Yes	Unmasked Teacher: Towards Training-Efficient Vid...	2023-03-28	Code
10	VideoMAE (K400 pretrain+finetune, ViT-H, 16x4)	39.5	Yes	VideoMAE: Masked Autoencoders are Data-Efficient...	2022-03-23	Code
11	VideoMAE (K700 pretrain+finetune, ViT-L, 16x4)	39.3	Yes	VideoMAE: Masked Autoencoders are Data-Efficient...	2022-03-23	Code
12	MVD (Kinetics400 pretrain+finetune, ViT-L, 16x4)	38.7	Yes	Masked Video Distillation: Rethinking Masked Fea...	2022-12-08	Code
13	VideoMAE (K400 pretrain+finetune, ViT-L, 16x4)	37.8	Yes	VideoMAE: Masked Autoencoders are Data-Efficient...	2022-03-23	Code
14	MVD (Kinetics400 pretrain, ViT-L, 16x4)	37.7	Yes	Masked Video Distillation: Rethinking Masked Fea...	2022-12-08	Code
15	VideoMAE (K400 pretrain, ViT-H, 16x4)	36.5	Yes	VideoMAE: Masked Autoencoders are Data-Efficient...	2022-03-23	Code
16	VideoMAE (K700 pretrain, ViT-L, 16x4)	36.1	Yes	VideoMAE: Masked Autoencoders are Data-Efficient...	2022-03-23	Code
17	MeMViT-24	35.4	Yes	MeMViT: Memory-Augmented Multiscale Vision Trans...	2022-01-20	Code
18	MViTv2-L (IN21k, K700)	34.4	Yes	MViTv2: Improved Multiscale Vision Transformers ...	2021-12-02	Code
19	VideoMAE (K400 pretrain, ViT-L, 16x4)	34.3	Yes	VideoMAE: Masked Autoencoders are Data-Efficient...	2022-03-23	Code
20	MVD (Kinetics400 pretrain+finetune, ViT-B, 16x4)	34.2	Yes	Masked Video Distillation: Rethinking Masked Fea...	2022-12-08	Code
21	AMD(ViT-B/16)	33.5	Yes	Asymmetric Masked Distillation for Pre-Training ...	2023-11-06	-
22	HIT	32.6	No	Holistic Interaction Transformer Network for Act...	2022-10-23	Code
23	VideoMAE (K400 pretrain+finetune, ViT-B, 16x4)	31.8	Yes	VideoMAE: Masked Autoencoders are Data-Efficient...	2022-03-23	Code
24	ACAR-Net, SlowFast R-101 (Kinetics-700 pretraining)	31.72	Yes	Actor-Context-Actor Relation Network for Spatio-...	2020-06-14	Code
25	MVD (Kinetics400 pretrain, ViT-B, 16x4)	31.1	Yes	Masked Video Distillation: Rethinking Masked Fea...	2022-12-08	Code
26	Object Transformer	31	No	Towards Long-Form Video Understanding	2021-06-21	Code
27	MViT-B-24, 32x3 (Kinetics-600 pretraining)	28.7	No	Multiscale Vision Transformers	2021-04-22	Code
28	MViT-B, 32x3 (Kinetics-500 pretraining)	27.5	No	Multiscale Vision Transformers	2021-04-22	Code
29	SlowFast, 16x8 R101+NL (Kinetics-600 pretraining)	27.5	No	SlowFast Networks for Video Recognition	2018-12-10	Code
30	MViT-B, 64x3 (Kinetics-400 pretraining)	27.3	No	Multiscale Vision Transformers	2021-04-22	Code
31	SlowFast, 8x8 R101+NL (Kinetics-600 pretraining)	27.1	No	SlowFast Networks for Video Recognition	2018-12-10	Code
32	MViT-B, 32x3 (Kinetics-400 pretraining)	26.8	No	Multiscale Vision Transformers	2021-04-22	Code
33	VideoMAE (K400 pretrain, ViT-B, 16x4)	26.7	Yes	VideoMAE: Masked Autoencoders are Data-Efficient...	2022-03-23	Code
34	ORViT MViT-B, 16x4 (K400 pretraining)	26.6	No	Object-Region Video Transformers	2021-10-13	Code
35	MViT-B, 16x4 (Kinetics-600 pretraining)	26.1	No	Multiscale Vision Transformers	2021-04-22	Code
36	MViT-B, 16x4 (Kinetics-400 pretraining)	24.5	No	Multiscale Vision Transformers	2021-04-22	Code
37	SlowFast, 8x8, R101 (Kinetics-400 pretraining)	23.8	No	SlowFast Networks for Video Recognition	2018-12-10	Code
38	SlowFast, 4x16, R50 (Kinetics-400 pretraining)	21.9	No	SlowFast Networks for Video Recognition	2018-12-10	Code

#1LART (Hiera-H, K700 PT+FT)SOTA
45.1
mAP· Extra Data· 2023-04-03
On the Benefits of 3D Pose and Tracking for Human Action Recognition Code
#2Hiera-H (K700 PT+FT)
43.3
mAP· Extra Data· 2023-06-01
Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles Code
#3VideoMAE V2-gSOTA
42.6
mAP· Extra Data· 2023-03-29
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking Code
#4STAR/L
41.7
mAP· Extra Data· 2023-04-24
End-to-End Spatio-Temporal Action Localisation with Video Transformers
#5MVD (Kinetics400 pretrain+finetune, ViT-H, 16x4)SOTA
41.1
mAP· Extra Data· 2022-12-08
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning Code
#6InternVideoSOTA
41.01
mAP· Extra Data· 2022-12-06
InternVideo: General Video Foundation Models via Generative and Discriminative Learning Code
#7MVD (Kinetics400 pretrain, ViT-H, 16x4)
40.1
mAP· Extra Data· 2022-12-08
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning Code
#8MaskFeat (Kinetics-600 pretrain, MViT-L)SOTA
39.8
mAP· Extra Data· 2021-12-16
Masked Feature Prediction for Self-Supervised Visual Pre-Training Code
#9UMT-L (ViT-L/16)
39.8
mAP· Extra Data· 2023-03-28
Unmasked Teacher: Towards Training-Efficient Video Foundation Models Code
#10VideoMAE (K400 pretrain+finetune, ViT-H, 16x4)
39.5
mAP· Extra Data· 2022-03-23
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training Code
#11VideoMAE (K700 pretrain+finetune, ViT-L, 16x4)
39.3
mAP· Extra Data· 2022-03-23
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training Code
#12MVD (Kinetics400 pretrain+finetune, ViT-L, 16x4)
38.7
mAP· Extra Data· 2022-12-08
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning Code
#13VideoMAE (K400 pretrain+finetune, ViT-L, 16x4)
37.8
mAP· Extra Data· 2022-03-23
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training Code
#14MVD (Kinetics400 pretrain, ViT-L, 16x4)
37.7
mAP· Extra Data· 2022-12-08
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning Code
#15VideoMAE (K400 pretrain, ViT-H, 16x4)
36.5
mAP· Extra Data· 2022-03-23
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training Code
#16VideoMAE (K700 pretrain, ViT-L, 16x4)
36.1
mAP· Extra Data· 2022-03-23
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training Code
#17MeMViT-24
35.4
mAP· Extra Data· 2022-01-20
MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition Code
#18MViTv2-L (IN21k, K700)SOTA
34.4
mAP· Extra Data· 2021-12-02
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection Code
#19VideoMAE (K400 pretrain, ViT-L, 16x4)
34.3
mAP· Extra Data· 2022-03-23
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training Code
#20MVD (Kinetics400 pretrain+finetune, ViT-B, 16x4)
34.2
mAP· Extra Data· 2022-12-08
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning Code
#21AMD(ViT-B/16)
33.5
mAP· Extra Data· 2023-11-06
Asymmetric Masked Distillation for Pre-Training Small Foundation Models
#22HIT
32.6
mAP· 2022-10-23
Holistic Interaction Transformer Network for Action Detection Code
#23VideoMAE (K400 pretrain+finetune, ViT-B, 16x4)
31.8
mAP· Extra Data· 2022-03-23
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training Code
#24ACAR-Net, SlowFast R-101 (Kinetics-700 pretraining)SOTA
31.72
mAP· Extra Data· 2020-06-14
Actor-Context-Actor Relation Network for Spatio-Temporal Action Localization Code
#25MVD (Kinetics400 pretrain, ViT-B, 16x4)
31.1
mAP· Extra Data· 2022-12-08
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning Code
#26Object Transformer
31
mAP· 2021-06-21
Towards Long-Form Video Understanding Code
#27MViT-B-24, 32x3 (Kinetics-600 pretraining)
28.7
mAP· 2021-04-22
Multiscale Vision Transformers Code
#28MViT-B, 32x3 (Kinetics-500 pretraining)
27.5
mAP· 2021-04-22
Multiscale Vision Transformers Code
#29SlowFast, 16x8 R101+NL (Kinetics-600 pretraining)SOTA
27.5
mAP· 2018-12-10
SlowFast Networks for Video Recognition Code
#30MViT-B, 64x3 (Kinetics-400 pretraining)
27.3
mAP· 2021-04-22
Multiscale Vision Transformers Code
#31SlowFast, 8x8 R101+NL (Kinetics-600 pretraining)
27.1
mAP· 2018-12-10
SlowFast Networks for Video Recognition Code
#32MViT-B, 32x3 (Kinetics-400 pretraining)
26.8
mAP· 2021-04-22
Multiscale Vision Transformers Code
#33VideoMAE (K400 pretrain, ViT-B, 16x4)
26.7
mAP· Extra Data· 2022-03-23
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training Code
#34ORViT MViT-B, 16x4 (K400 pretraining)
26.6
mAP· 2021-10-13
Object-Region Video Transformers Code
#35MViT-B, 16x4 (Kinetics-600 pretraining)
26.1
mAP· 2021-04-22
Multiscale Vision Transformers Code
#36MViT-B, 16x4 (Kinetics-400 pretraining)
24.5
mAP· 2021-04-22
Multiscale Vision Transformers Code
#37SlowFast, 8x8, R101 (Kinetics-400 pretraining)
23.8
mAP· 2018-12-10
SlowFast Networks for Video Recognition Code
#38SlowFast, 4x16, R50 (Kinetics-400 pretraining)
21.9
mAP· 2018-12-10
SlowFast Networks for Video Recognition Code