Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Computer Vision
/
Video Captioning
/
ActivityNet Captions
Video Captioning on ActivityNet Captions
Metric: CIDEr (higher is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Hide extra data
Export CSV
Sort:
CIDEr (best first)
CIDEr (worst first)
Date (newest first)
Date (oldest first)
Model name (A→Z)
#
Model
↕
CIDEr
▼
Extra Data
Paper
Date
↕
Code
1
VideoCoCa
39.3
Yes
VideoCoCa: Video-Text Modeling with Zero-Shot Tr...
2022-12-09
-
2
GVL
33.33
No
Learning Grounded Vision-Language Representation...
2023-03-11
Code
3
CM²
33.01
No
Do You Remember? Dense Video Captioning with Cro...
2024-04-11
Code
4
VLCap (ae-test split) - Appearance + Language
31.29
No
VLCap: Vision-Language with Contrastive Learning...
2022-06-26
Code
5
PDVC (TSP features, no SCST)
31.14
No
End-to-End Dense Video Captioning with Parallel ...
2021-08-17
Code
6
VLTinT (ae-test split) C3D/Ling
31.13
No
VLTinT: Visual-Linguistic Transformer-in-Transfo...
2022-11-28
Code
7
COOT (ae-test split) - Only Appearance features
28.19
No
COOT: Cooperative Hierarchical Transformer for V...
2020-11-01
Code
8
Vid2Seq
28
Yes
Vid2Seq: Large-Scale Pretraining of a Visual Lan...
2023-02-27
Code
9
VTimeLLM
27.6
No
VTimeLLM: Empower LLM to Grasp Video Moments
2023-11-30
Code
10
MART (ae-test split) - Appearance + Flow
23.42
No
MART: Memory-Augmented Recurrent Transformer for...
2020-05-11
Code
11
ADV-INF + Global
19.4
No
-
-
Code
#1
VideoCoCa
SOTA
39.3
CIDEr
· Extra Data
· 2022-12-09
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners
#2
GVL
33.33
CIDEr
· 2023-03-11
Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos
Code
#3
CM²
33.01
CIDEr
· 2024-04-11
Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval
Code
#4
VLCap (ae-test split) - Appearance + Language
SOTA
31.29
CIDEr
· 2022-06-26
VLCap: Vision-Language with Contrastive Learning for Coherent Video Paragraph Captioning
Code
#5
PDVC (TSP features, no SCST)
SOTA
31.14
CIDEr
· 2021-08-17
End-to-End Dense Video Captioning with Parallel Decoding
Code
#6
VLTinT (ae-test split) C3D/Ling
31.13
CIDEr
· 2022-11-28
VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video Paragraph Captioning
Code
#7
COOT (ae-test split) - Only Appearance features
SOTA
28.19
CIDEr
· 2020-11-01
COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning
Code
#8
Vid2Seq
28
CIDEr
· Extra Data
· 2023-02-27
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning
Code
#9
VTimeLLM
27.6
CIDEr
· 2023-11-30
VTimeLLM: Empower LLM to Grasp Video Moments
Code
#10
MART (ae-test split) - Appearance + Flow
SOTA
23.42
CIDEr
· 2020-05-11
MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning
Code
#11
ADV-INF + Global
19.4
CIDEr
No paper
Code