Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Time Series
/
Action Recognition
/
AVA v2.2
Action Recognition on AVA v2.2
Metric: mAP (higher is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Hide extra data
Export CSV
Sort:
mAP (best first)
mAP (worst first)
Date (newest first)
Date (oldest first)
Model name (A→Z)
#
Model
↕
mAP
▼
Extra Data
Paper
Date
↕
Code
1
LART (Hiera-H, K700 PT+FT)
45.1
Yes
On the Benefits of 3D Pose and Tracking for Huma...
2023-04-03
Code
2
Hiera-H (K700 PT+FT)
43.3
Yes
Hiera: A Hierarchical Vision Transformer without...
2023-06-01
Code
3
VideoMAE V2-g
42.6
Yes
VideoMAE V2: Scaling Video Masked Autoencoders w...
2023-03-29
Code
4
STAR/L
41.7
Yes
End-to-End Spatio-Temporal Action Localisation w...
2023-04-24
-
5
MVD (Kinetics400 pretrain+finetune, ViT-H, 16x4)
41.1
Yes
Masked Video Distillation: Rethinking Masked Fea...
2022-12-08
Code
6
InternVideo
41.01
Yes
InternVideo: General Video Foundation Models via...
2022-12-06
Code
7
MVD (Kinetics400 pretrain, ViT-H, 16x4)
40.1
Yes
Masked Video Distillation: Rethinking Masked Fea...
2022-12-08
Code
8
MaskFeat (Kinetics-600 pretrain, MViT-L)
39.8
Yes
Masked Feature Prediction for Self-Supervised Vi...
2021-12-16
Code
9
UMT-L (ViT-L/16)
39.8
Yes
Unmasked Teacher: Towards Training-Efficient Vid...
2023-03-28
Code
10
VideoMAE (K400 pretrain+finetune, ViT-H, 16x4)
39.5
Yes
VideoMAE: Masked Autoencoders are Data-Efficient...
2022-03-23
Code
11
VideoMAE (K700 pretrain+finetune, ViT-L, 16x4)
39.3
Yes
VideoMAE: Masked Autoencoders are Data-Efficient...
2022-03-23
Code
12
MVD (Kinetics400 pretrain+finetune, ViT-L, 16x4)
38.7
Yes
Masked Video Distillation: Rethinking Masked Fea...
2022-12-08
Code
13
VideoMAE (K400 pretrain+finetune, ViT-L, 16x4)
37.8
Yes
VideoMAE: Masked Autoencoders are Data-Efficient...
2022-03-23
Code
14
MVD (Kinetics400 pretrain, ViT-L, 16x4)
37.7
Yes
Masked Video Distillation: Rethinking Masked Fea...
2022-12-08
Code
15
VideoMAE (K400 pretrain, ViT-H, 16x4)
36.5
Yes
VideoMAE: Masked Autoencoders are Data-Efficient...
2022-03-23
Code
16
VideoMAE (K700 pretrain, ViT-L, 16x4)
36.1
Yes
VideoMAE: Masked Autoencoders are Data-Efficient...
2022-03-23
Code
17
MeMViT-24
35.4
Yes
MeMViT: Memory-Augmented Multiscale Vision Trans...
2022-01-20
Code
18
MViTv2-L (IN21k, K700)
34.4
Yes
MViTv2: Improved Multiscale Vision Transformers ...
2021-12-02
Code
19
VideoMAE (K400 pretrain, ViT-L, 16x4)
34.3
Yes
VideoMAE: Masked Autoencoders are Data-Efficient...
2022-03-23
Code
20
MVD (Kinetics400 pretrain+finetune, ViT-B, 16x4)
34.2
Yes
Masked Video Distillation: Rethinking Masked Fea...
2022-12-08
Code
21
AMD(ViT-B/16)
33.5
Yes
Asymmetric Masked Distillation for Pre-Training ...
2023-11-06
-
22
HIT
32.6
No
Holistic Interaction Transformer Network for Act...
2022-10-23
Code
23
VideoMAE (K400 pretrain+finetune, ViT-B, 16x4)
31.8
Yes
VideoMAE: Masked Autoencoders are Data-Efficient...
2022-03-23
Code
24
ACAR-Net, SlowFast R-101 (Kinetics-700 pretraining)
31.72
Yes
Actor-Context-Actor Relation Network for Spatio-...
2020-06-14
Code
25
MVD (Kinetics400 pretrain, ViT-B, 16x4)
31.1
Yes
Masked Video Distillation: Rethinking Masked Fea...
2022-12-08
Code
26
Object Transformer
31
No
Towards Long-Form Video Understanding
2021-06-21
Code
27
MViT-B-24, 32x3 (Kinetics-600 pretraining)
28.7
No
Multiscale Vision Transformers
2021-04-22
Code
28
MViT-B, 32x3 (Kinetics-500 pretraining)
27.5
No
Multiscale Vision Transformers
2021-04-22
Code
29
SlowFast, 16x8 R101+NL (Kinetics-600 pretraining)
27.5
No
SlowFast Networks for Video Recognition
2018-12-10
Code
30
MViT-B, 64x3 (Kinetics-400 pretraining)
27.3
No
Multiscale Vision Transformers
2021-04-22
Code
31
SlowFast, 8x8 R101+NL (Kinetics-600 pretraining)
27.1
No
SlowFast Networks for Video Recognition
2018-12-10
Code
32
MViT-B, 32x3 (Kinetics-400 pretraining)
26.8
No
Multiscale Vision Transformers
2021-04-22
Code
33
VideoMAE (K400 pretrain, ViT-B, 16x4)
26.7
Yes
VideoMAE: Masked Autoencoders are Data-Efficient...
2022-03-23
Code
34
ORViT MViT-B, 16x4 (K400 pretraining)
26.6
No
Object-Region Video Transformers
2021-10-13
Code
35
MViT-B, 16x4 (Kinetics-600 pretraining)
26.1
No
Multiscale Vision Transformers
2021-04-22
Code
36
MViT-B, 16x4 (Kinetics-400 pretraining)
24.5
No
Multiscale Vision Transformers
2021-04-22
Code
37
SlowFast, 8x8, R101 (Kinetics-400 pretraining)
23.8
No
SlowFast Networks for Video Recognition
2018-12-10
Code
38
SlowFast, 4x16, R50 (Kinetics-400 pretraining)
21.9
No
SlowFast Networks for Video Recognition
2018-12-10
Code
#1
LART (Hiera-H, K700 PT+FT)
SOTA
45.1
mAP
· Extra Data
· 2023-04-03
On the Benefits of 3D Pose and Tracking for Human Action Recognition
Code
#2
Hiera-H (K700 PT+FT)
43.3
mAP
· Extra Data
· 2023-06-01
Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles
Code
#3
VideoMAE V2-g
SOTA
42.6
mAP
· Extra Data
· 2023-03-29
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
Code
#4
STAR/L
41.7
mAP
· Extra Data
· 2023-04-24
End-to-End Spatio-Temporal Action Localisation with Video Transformers
#5
MVD (Kinetics400 pretrain+finetune, ViT-H, 16x4)
SOTA
41.1
mAP
· Extra Data
· 2022-12-08
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning
Code
#6
InternVideo
SOTA
41.01
mAP
· Extra Data
· 2022-12-06
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
Code
#7
MVD (Kinetics400 pretrain, ViT-H, 16x4)
40.1
mAP
· Extra Data
· 2022-12-08
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning
Code
#8
MaskFeat (Kinetics-600 pretrain, MViT-L)
SOTA
39.8
mAP
· Extra Data
· 2021-12-16
Masked Feature Prediction for Self-Supervised Visual Pre-Training
Code
#9
UMT-L (ViT-L/16)
39.8
mAP
· Extra Data
· 2023-03-28
Unmasked Teacher: Towards Training-Efficient Video Foundation Models
Code
#10
VideoMAE (K400 pretrain+finetune, ViT-H, 16x4)
39.5
mAP
· Extra Data
· 2022-03-23
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
Code
#11
VideoMAE (K700 pretrain+finetune, ViT-L, 16x4)
39.3
mAP
· Extra Data
· 2022-03-23
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
Code
#12
MVD (Kinetics400 pretrain+finetune, ViT-L, 16x4)
38.7
mAP
· Extra Data
· 2022-12-08
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning
Code
#13
VideoMAE (K400 pretrain+finetune, ViT-L, 16x4)
37.8
mAP
· Extra Data
· 2022-03-23
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
Code
#14
MVD (Kinetics400 pretrain, ViT-L, 16x4)
37.7
mAP
· Extra Data
· 2022-12-08
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning
Code
#15
VideoMAE (K400 pretrain, ViT-H, 16x4)
36.5
mAP
· Extra Data
· 2022-03-23
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
Code
#16
VideoMAE (K700 pretrain, ViT-L, 16x4)
36.1
mAP
· Extra Data
· 2022-03-23
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
Code
#17
MeMViT-24
35.4
mAP
· Extra Data
· 2022-01-20
MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition
Code
#18
MViTv2-L (IN21k, K700)
SOTA
34.4
mAP
· Extra Data
· 2021-12-02
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection
Code
#19
VideoMAE (K400 pretrain, ViT-L, 16x4)
34.3
mAP
· Extra Data
· 2022-03-23
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
Code
#20
MVD (Kinetics400 pretrain+finetune, ViT-B, 16x4)
34.2
mAP
· Extra Data
· 2022-12-08
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning
Code
#21
AMD(ViT-B/16)
33.5
mAP
· Extra Data
· 2023-11-06
Asymmetric Masked Distillation for Pre-Training Small Foundation Models
#22
HIT
32.6
mAP
· 2022-10-23
Holistic Interaction Transformer Network for Action Detection
Code
#23
VideoMAE (K400 pretrain+finetune, ViT-B, 16x4)
31.8
mAP
· Extra Data
· 2022-03-23
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
Code
#24
ACAR-Net, SlowFast R-101 (Kinetics-700 pretraining)
SOTA
31.72
mAP
· Extra Data
· 2020-06-14
Actor-Context-Actor Relation Network for Spatio-Temporal Action Localization
Code
#25
MVD (Kinetics400 pretrain, ViT-B, 16x4)
31.1
mAP
· Extra Data
· 2022-12-08
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning
Code
#26
Object Transformer
31
mAP
· 2021-06-21
Towards Long-Form Video Understanding
Code
#27
MViT-B-24, 32x3 (Kinetics-600 pretraining)
28.7
mAP
· 2021-04-22
Multiscale Vision Transformers
Code
#28
MViT-B, 32x3 (Kinetics-500 pretraining)
27.5
mAP
· 2021-04-22
Multiscale Vision Transformers
Code
#29
SlowFast, 16x8 R101+NL (Kinetics-600 pretraining)
SOTA
27.5
mAP
· 2018-12-10
SlowFast Networks for Video Recognition
Code
#30
MViT-B, 64x3 (Kinetics-400 pretraining)
27.3
mAP
· 2021-04-22
Multiscale Vision Transformers
Code
#31
SlowFast, 8x8 R101+NL (Kinetics-600 pretraining)
27.1
mAP
· 2018-12-10
SlowFast Networks for Video Recognition
Code
#32
MViT-B, 32x3 (Kinetics-400 pretraining)
26.8
mAP
· 2021-04-22
Multiscale Vision Transformers
Code
#33
VideoMAE (K400 pretrain, ViT-B, 16x4)
26.7
mAP
· Extra Data
· 2022-03-23
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
Code
#34
ORViT MViT-B, 16x4 (K400 pretraining)
26.6
mAP
· 2021-10-13
Object-Region Video Transformers
Code
#35
MViT-B, 16x4 (Kinetics-600 pretraining)
26.1
mAP
· 2021-04-22
Multiscale Vision Transformers
Code
#36
MViT-B, 16x4 (Kinetics-400 pretraining)
24.5
mAP
· 2021-04-22
Multiscale Vision Transformers
Code
#37
SlowFast, 8x8, R101 (Kinetics-400 pretraining)
23.8
mAP
· 2018-12-10
SlowFast Networks for Video Recognition
Code
#38
SlowFast, 4x16, R50 (Kinetics-400 pretraining)
21.9
mAP
· 2018-12-10
SlowFast Networks for Video Recognition
Code