Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Computer Vision
/
Video Captioning
/
YouCook2
Video Captioning on YouCook2
Metric: BLEU-4 (higher is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Hide extra data
Export CSV
Sort:
BLEU-4 (best first)
BLEU-4 (worst first)
Date (newest first)
Date (oldest first)
Model name (A→Z)
#
Model
↕
BLEU-4
▼
Extra Data
Paper
Date
↕
Code
1
VAST
18.2
Yes
VAST: A Vision-Audio-Subtitle-Text Omni-Modality...
2023-05-29
Code
2
UniVL + MELTR
17.92
No
MELTR: Meta Loss Transformer for Learning to Fin...
2023-03-23
Code
3
UniVL
17.35
Yes
UniVL: A Unified Video and Language Pre-Training...
2020-02-15
Code
4
VideoCoCa
14.2
Yes
VideoCoCa: Video-Text Modeling with Zero-Shot Tr...
2022-12-09
-
5
VLM
12.27
Yes
VLM: Task-agnostic Video-Language Model Pre-trai...
2021-05-20
Code
6
E2vidD6-MASSvid-BiD
12.04
Yes
Multimodal Pretraining for Dense Video Captioning
2020-11-10
Code
7
TextKG
11.7
No
Text with Knowledge Graph Augmented Transformer ...
2023-03-22
-
8
COOT
11.3
Yes
COOT: Cooperative Hierarchical Transformer for V...
2020-11-01
Code
9
COSA
10.1
Yes
COSA: Concatenated Sample Pretrained Vision-Lang...
2023-06-15
Code
10
HowToCaption
8.8
No
HowToCaption: Prompting LLMs to Transform Video ...
2023-10-07
Code
11
OmniVL
8.72
No
OmniVL:One Foundation Model for Image-Language a...
2022-09-15
-
12
Zhou
4.38
No
End-to-End Dense Video Captioning with Masked Tr...
2018-04-03
Code
13
VideoBERT + S3D
4.33
No
VideoBERT: A Joint Model for Video and Language ...
2019-04-03
Code
#1
VAST
SOTA
18.2
BLEU-4
· Extra Data
· 2023-05-29
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
Code
#2
UniVL + MELTR
SOTA
17.92
BLEU-4
· 2023-03-23
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models
Code
#3
UniVL
SOTA
17.35
BLEU-4
· Extra Data
· 2020-02-15
UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation
Code
#4
VideoCoCa
14.2
BLEU-4
· Extra Data
· 2022-12-09
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners
#5
VLM
12.27
BLEU-4
· Extra Data
· 2021-05-20
VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding
Code
#6
E2vidD6-MASSvid-BiD
12.04
BLEU-4
· Extra Data
· 2020-11-10
Multimodal Pretraining for Dense Video Captioning
Code
#7
TextKG
11.7
BLEU-4
· 2023-03-22
Text with Knowledge Graph Augmented Transformer for Video Captioning
#8
COOT
11.3
BLEU-4
· Extra Data
· 2020-11-01
COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning
Code
#9
COSA
10.1
BLEU-4
· Extra Data
· 2023-06-15
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
Code
#10
HowToCaption
8.8
BLEU-4
· 2023-10-07
HowToCaption: Prompting LLMs to Transform Video Annotations at Scale
Code
#11
OmniVL
8.72
BLEU-4
· 2022-09-15
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks
#12
Zhou
SOTA
4.38
BLEU-4
· 2018-04-03
End-to-End Dense Video Captioning with Masked Transformer
Code
#13
VideoBERT + S3D
4.33
BLEU-4
· 2019-04-03
VideoBERT: A Joint Model for Video and Language Representation Learning
Code