Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Computer Vision
/
Video Captioning
/
YouCook2
Video Captioning on YouCook2
Metric: CIDEr (higher is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Hide extra data
Export CSV
Sort:
CIDEr (best first)
CIDEr (worst first)
Date (newest first)
Date (oldest first)
Model name (A→Z)
#
Model
↕
CIDEr
▼
Extra Data
Paper
Date
↕
Code
1
HowToCaption
116.4
No
HowToCaption: Prompting LLMs to Transform Video ...
2023-10-07
Code
2
HiCM²
71.84
Yes
HiCM$^2$: Hierarchical Compact Memory Modeling f...
2024-12-19
Code
3
Vid2Seq (HowTo100M+VidChapters-7M PT)
67.2
Yes
-
-
-
4
Vid2Seq
47.1
Yes
Vid2Seq: Large-Scale Pretraining of a Visual Lan...
2023-02-27
Code
5
CM²
31.66
No
Do You Remember? Dense Video Captioning with Cro...
2024-04-11
Code
6
GVL
26.52
No
Learning Grounded Vision-Language Representation...
2023-03-11
Code
7
PDVC (TSN features, no SCST)
22.71
No
End-to-End Dense Video Captioning with Parallel ...
2021-08-17
Code
8
Vid2Seq (HowTo100M+VidChapters-7M PT)
13.3
Yes
-
-
-
9
VAST
1.99
Yes
VAST: A Vision-Audio-Subtitle-Text Omni-Modality...
2023-05-29
Code
10
UniVL + MELTR
1.9
No
MELTR: Meta Loss Transformer for Learning to Fin...
2023-03-23
Code
11
UniVL
1.81
Yes
UniVL: A Unified Video and Language Pre-Training...
2020-02-15
Code
12
VLM
1.3869
Yes
VLM: Task-agnostic Video-Language Model Pre-trai...
2021-05-20
Code
13
TextKG
1.33
No
Text with Knowledge Graph Augmented Transformer ...
2023-03-22
-
14
COSA
1.31
Yes
COSA: Concatenated Sample Pretrained Vision-Lang...
2023-06-15
Code
15
MA-LMM
1.31
No
MA-LMM: Memory-Augmented Large Multimodal Model ...
2024-04-08
Code
16
VideoCoCa
1.28
Yes
VideoCoCa: Video-Text Modeling with Zero-Shot Tr...
2022-12-09
-
17
E2vidD6-MASSvid-BiD
1.22
Yes
Multimodal Pretraining for Dense Video Captioning
2020-11-10
Code
18
OmniVL
1.16
No
OmniVL:One Foundation Model for Image-Language a...
2022-09-15
-
19
COOT
0.57
Yes
COOT: Cooperative Hierarchical Transformer for V...
2020-11-01
Code
20
VideoBERT + S3D
0.55
No
VideoBERT: A Joint Model for Video and Language ...
2019-04-03
Code
21
Zhou
0.38
No
End-to-End Dense Video Captioning with Masked Tr...
2018-04-03
Code
#1
HowToCaption
SOTA
116.4
CIDEr
· 2023-10-07
HowToCaption: Prompting LLMs to Transform Video Annotations at Scale
Code
#2
HiCM²
71.84
CIDEr
· Extra Data
· 2024-12-19
HiCM$^2$: Hierarchical Compact Memory Modeling for Dense Video Captioning
Code
#3
Vid2Seq (HowTo100M+VidChapters-7M PT)
67.2
CIDEr
· Extra Data
No paper
#4
Vid2Seq
SOTA
47.1
CIDEr
· Extra Data
· 2023-02-27
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning
Code
#5
CM²
31.66
CIDEr
· 2024-04-11
Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval
Code
#6
GVL
26.52
CIDEr
· 2023-03-11
Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos
Code
#7
PDVC (TSN features, no SCST)
SOTA
22.71
CIDEr
· 2021-08-17
End-to-End Dense Video Captioning with Parallel Decoding
Code
#8
Vid2Seq (HowTo100M+VidChapters-7M PT)
13.3
CIDEr
· Extra Data
No paper
#9
VAST
1.99
CIDEr
· Extra Data
· 2023-05-29
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
Code
#10
UniVL + MELTR
1.9
CIDEr
· 2023-03-23
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models
Code
#11
UniVL
SOTA
1.81
CIDEr
· Extra Data
· 2020-02-15
UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation
Code
#12
VLM
1.3869
CIDEr
· Extra Data
· 2021-05-20
VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding
Code
#13
TextKG
1.33
CIDEr
· 2023-03-22
Text with Knowledge Graph Augmented Transformer for Video Captioning
#14
COSA
1.31
CIDEr
· Extra Data
· 2023-06-15
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
Code
#15
MA-LMM
1.31
CIDEr
· 2024-04-08
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
Code
#16
VideoCoCa
1.28
CIDEr
· Extra Data
· 2022-12-09
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners
#17
E2vidD6-MASSvid-BiD
1.22
CIDEr
· Extra Data
· 2020-11-10
Multimodal Pretraining for Dense Video Captioning
Code
#18
OmniVL
1.16
CIDEr
· 2022-09-15
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks
#19
COOT
0.57
CIDEr
· Extra Data
· 2020-11-01
COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning
Code
#20
VideoBERT + S3D
SOTA
0.55
CIDEr
· 2019-04-03
VideoBERT: A Joint Model for Video and Language Representation Learning
Code
#21
Zhou
SOTA
0.38
CIDEr
· 2018-04-03
End-to-End Dense Video Captioning with Masked Transformer
Code