Action Recognition on EPIC-KITCHENS-100

Metric: Action@1 (higher is better)

LeaderboardDataset

Loading chart...

Results

Hide extra data

Sort:

#	Model↕	Action@1▼	Extra Data	Paper	Date↕	Code
1	LLaVAction	58.3	Yes	LLaVAction: evaluating and training multi-modal ...	2025-03-24	Code
2	TIM	56.4	Yes	TIM: A Time Interval Machine for Audio-Visual Ac...	2024-04-08	Code
3	Avion (ViT-L)	54.4	Yes	Training a Large Video Model on a Single Machine...	2023-09-28	Code
4	M&M (WTS 60M)	53.6	Yes	M&M Mix: A Multimodal Multiview Transformer Ense...	2022-06-20	-
5	LVMAE	52.1	Yes	Extending Video Masked Autoencoders to 128 frames	2024-11-20	-
6	TAdaFormer-L/14	51.8	Yes	Temporally-Adaptive Models for Efficient Video U...	2023-08-10	Code
7	LaViLa (TimeSformer-L)	51	Yes	Learning Video Representations from Large Langua...	2022-12-08	Code
8	MTV-B (WTS 60M)	50.5	Yes	Multiview Transformers for Video Recognition	2022-01-12	Code
9	OMNIVORE (Swin-B, finetuned)	49.9	Yes	Omnivore: A Single Model for Many Visual Modalit...	2022-01-20	Code
10	CAST(ViT-B/16)	49.3	No	CAST: Cross-Attention in Space and Time for Vide...	2023-11-30	Code
11	TAdaConvNeXtV2-S	48.9	Yes	Temporally-Adaptive Models for Efficient Video U...	2023-08-10	Code
12	MeMViT-24	48.4	Yes	MeMViT: Memory-Augmented Multiscale Vision Trans...	2022-01-20	Code
13	MMT	47.8	No	-	-	-
14	MoViNet-A6	47.7	No	MoViNets: Mobile Video Networks for Efficient Vi...	2021-03-21	Code
15	AVT	47.2	No	-	-	-
16	ORViT Mformer-L (ORViT blocks)	45.7	No	Object-Region Video Transformers	2021-10-13	Code
17	TempAgg	45.26	No	Technical Report: Temporal Aggregate Representat...	2021-06-06	Code
18	MoViNet-A5	44.5	No	MoViNets: Mobile Video Networks for Efficient Vi...	2021-03-21	Code
19	Mformer-HR	44.5	Yes	Keeping Your Eye on the Ball: Trajectory Attenti...	2021-06-09	Code
20	GSF	44.48	Yes	Gate-Shift-Fuse for Video Action Recognition	2022-03-16	Code
21	MoViNet-A4	44.4	No	MoViNets: Mobile Video Networks for Efficient Vi...	2021-03-21	Code
22	Mformer-L	44.1	Yes	Keeping Your Eye on the Ball: Trajectory Attenti...	2021-06-09	Code
23	ViViT-L/16x2 Fact. encoder	44	No	ViViT: A Video Vision Transformer	2021-03-29	Code
24	MBT	43.4	No	Attention Bottlenecks for Multimodal Fusion	2021-06-30	Code
25	Mformer	43.1	Yes	Keeping Your Eye on the Ball: Trajectory Attenti...	2021-06-09	Code
26	MoViNet-A2	41.2	No	MoViNets: Mobile Video Networks for Efficient Vi...	2021-03-21	Code
27	TSM	37.39	No	Rescaling Egocentric Vision	2020-06-23	Code
28	SlowFast	36.81	No	Rescaling Egocentric Vision	2020-06-23	Code
29	MoViNet-A0	36.8	No	MoViNets: Mobile Video Networks for Efficient Vi...	2021-03-21	Code
30	TBN	35.55	No	Rescaling Egocentric Vision	2020-06-23	Code
31	TRN	35.28	No	Rescaling Egocentric Vision	2020-06-23	Code
32	TSN	33.57	No	Rescaling Egocentric Vision	2020-06-23	Code

#1LLaVActionSOTA
58.3
Action@1· Extra Data· 2025-03-24
LLaVAction: evaluating and training multi-modal large language models for action recognition Code
#2TIMSOTA
56.4
Action@1· Extra Data· 2024-04-08
TIM: A Time Interval Machine for Audio-Visual Action Recognition Code
#3Avion (ViT-L)SOTA
54.4
Action@1· Extra Data· 2023-09-28
Training a Large Video Model on a Single Machine in a Day Code
#4M&M (WTS 60M)SOTA
53.6
Action@1· Extra Data· 2022-06-20
M&M Mix: A Multimodal Multiview Transformer Ensemble
#5LVMAE
52.1
Action@1· Extra Data· 2024-11-20
Extending Video Masked Autoencoders to 128 frames
#6TAdaFormer-L/14
51.8
Action@1· Extra Data· 2023-08-10
Temporally-Adaptive Models for Efficient Video Understanding Code
#7LaViLa (TimeSformer-L)
51
Action@1· Extra Data· 2022-12-08
Learning Video Representations from Large Language Models Code
#8MTV-B (WTS 60M)SOTA
50.5
Action@1· Extra Data· 2022-01-12
Multiview Transformers for Video Recognition Code
#9OMNIVORE (Swin-B, finetuned)
49.9
Action@1· Extra Data· 2022-01-20
Omnivore: A Single Model for Many Visual Modalities Code
#10CAST(ViT-B/16)
49.3
Action@1· 2023-11-30
CAST: Cross-Attention in Space and Time for Video Action Recognition Code
#11TAdaConvNeXtV2-S
48.9
Action@1· Extra Data· 2023-08-10
Temporally-Adaptive Models for Efficient Video Understanding Code
#12MeMViT-24
48.4
Action@1· Extra Data· 2022-01-20
MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition Code
#13MMT
47.8
Action@1
No paper
#14MoViNet-A6SOTA
47.7
Action@1· 2021-03-21
MoViNets: Mobile Video Networks for Efficient Video Recognition Code
#15AVT
47.2
Action@1
No paper
#16ORViT Mformer-L (ORViT blocks)
45.7
Action@1· 2021-10-13
Object-Region Video Transformers Code
#17TempAgg
45.26
Action@1· 2021-06-06
Technical Report: Temporal Aggregate Representations Code
#18MoViNet-A5
44.5
Action@1· 2021-03-21
MoViNets: Mobile Video Networks for Efficient Video Recognition Code
#19Mformer-HR
44.5
Action@1· Extra Data· 2021-06-09
Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers Code
#20GSF
44.48
Action@1· Extra Data· 2022-03-16
Gate-Shift-Fuse for Video Action Recognition Code
#21MoViNet-A4
44.4
Action@1· 2021-03-21
MoViNets: Mobile Video Networks for Efficient Video Recognition Code
#22Mformer-L
44.1
Action@1· Extra Data· 2021-06-09
Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers Code
#23ViViT-L/16x2 Fact. encoder
44
Action@1· 2021-03-29
ViViT: A Video Vision Transformer Code
#24MBT
43.4
Action@1· 2021-06-30
Attention Bottlenecks for Multimodal Fusion Code
#25Mformer
43.1
Action@1· Extra Data· 2021-06-09
Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers Code
#26MoViNet-A2
41.2
Action@1· 2021-03-21
MoViNets: Mobile Video Networks for Efficient Video Recognition Code
#27TSMSOTA
37.39
Action@1· 2020-06-23
Rescaling Egocentric Vision Code
#28SlowFast
36.81
Action@1· 2020-06-23
Rescaling Egocentric Vision Code
#29MoViNet-A0
36.8
Action@1· 2021-03-21
MoViNets: Mobile Video Networks for Efficient Video Recognition Code
#30TBN
35.55
Action@1· 2020-06-23
Rescaling Egocentric Vision Code
#31TRN
35.28
Action@1· 2020-06-23
Rescaling Egocentric Vision Code
#32TSN
33.57
Action@1· 2020-06-23
Rescaling Egocentric Vision Code