Video Captioning on YouCook2

Metric: METEOR (higher is better)

LeaderboardDataset

Loading chart...

Results

Hide extra data

Sort:

#	Model↕	METEOR▼	Extra Data	Paper	Date↕	Code
1	UniVL + MELTR	22.56	No	MELTR: Meta Loss Transformer for Learning to Fin...	2023-03-23	Code
2	UniVL	22.35	Yes	UniVL: A Unified Video and Language Pre-Training...	2020-02-15	Code
3	COOT	19.85	Yes	COOT: Cooperative Hierarchical Transformer for V...	2020-11-01	Code
4	E2vidD6-MASSvid-BiD	18.32	Yes	Multimodal Pretraining for Dense Video Captioning	2020-11-10	Code
5	VLM	18.22	Yes	VLM: Task-agnostic Video-Language Model Pre-trai...	2021-05-20	Code
6	MA-LMM	17.6	No	MA-LMM: Memory-Augmented Large Multimodal Model ...	2024-04-08	Code
7	HowToCaption	15.9	No	HowToCaption: Prompting LLMs to Transform Video ...	2023-10-07	Code
8	OmniVL	14.83	No	OmniVL:One Foundation Model for Image-Language a...	2022-09-15	-
9	TextKG	14.8	No	Text with Knowledge Graph Augmented Transformer ...	2023-03-22	-
10	HiCM²	12.8	Yes	HiCM$^2$: Hierarchical Compact Memory Modeling f...	2024-12-19	Code
11	Vid2Seq (HowTo100M+VidChapters-7M PT)	12.3	Yes	-	-	-
12	VideoBERT + S3D	11.94	No	VideoBERT: A Joint Model for Video and Language ...	2019-04-03	Code
13	Zhou	11.55	No	End-to-End Dense Video Captioning with Masked Tr...	2018-04-03	Code
14	Vid2Seq	9.3	Yes	Vid2Seq: Large-Scale Pretraining of a Visual Lan...	2023-02-27	Code
15	CM²	6.08	No	Do You Remember? Dense Video Captioning with Cro...	2024-04-11	Code
16	GVL	5.01	No	Learning Grounded Vision-Language Representation...	2023-03-11	Code
17	PDVC (TSN features, no SCST)	4.74	No	End-to-End Dense Video Captioning with Parallel ...	2021-08-17	Code
18	Vid2Seq (HowTo100M+VidChapters-7M PT)	3.4	Yes	-	-	-

#1UniVL + MELTRSOTA
22.56
METEOR· 2023-03-23
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models Code
#2UniVLSOTA
22.35
METEOR· Extra Data· 2020-02-15
UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation Code
#3COOT
19.85
METEOR· Extra Data· 2020-11-01
COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning Code
#4E2vidD6-MASSvid-BiD
18.32
METEOR· Extra Data· 2020-11-10
Multimodal Pretraining for Dense Video Captioning Code
#5VLM
18.22
METEOR· Extra Data· 2021-05-20
VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding Code
#6MA-LMM
17.6
METEOR· 2024-04-08
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding Code
#7HowToCaption
15.9
METEOR· 2023-10-07
HowToCaption: Prompting LLMs to Transform Video Annotations at Scale Code
#8OmniVL
14.83
METEOR· 2022-09-15
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks
#9TextKG
14.8
METEOR· 2023-03-22
Text with Knowledge Graph Augmented Transformer for Video Captioning
#10HiCM²
12.8
METEOR· Extra Data· 2024-12-19
HiCM$^2$: Hierarchical Compact Memory Modeling for Dense Video Captioning Code
#11Vid2Seq (HowTo100M+VidChapters-7M PT)
12.3
METEOR· Extra Data
No paper
#12VideoBERT + S3DSOTA
11.94
METEOR· 2019-04-03
VideoBERT: A Joint Model for Video and Language Representation Learning Code
#13ZhouSOTA
11.55
METEOR· 2018-04-03
End-to-End Dense Video Captioning with Masked Transformer Code
#14Vid2Seq
9.3
METEOR· Extra Data· 2023-02-27
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning Code
#15CM²
6.08
METEOR· 2024-04-11
Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval Code
#16GVL
5.01
METEOR· 2023-03-11
Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos Code
#17PDVC (TSN features, no SCST)
4.74
METEOR· 2021-08-17
End-to-End Dense Video Captioning with Parallel Decoding Code
#18Vid2Seq (HowTo100M+VidChapters-7M PT)
3.4
METEOR· Extra Data
No paper