Metric: ROUGE-L (higher is better)
| # | Model↕ | ROUGE-L▼ | Extra Data | Paper | Date↕ | Code |
|---|---|---|---|---|---|---|
| 1 | VALOR | 57.4 | Yes | VALOR: Vision-Audio-Language Omni-Perception Pre... | 2023-04-17 | Code |
| 2 | VideoCoCa | 54.5 | Yes | VideoCoCa: Video-Text Modeling with Zero-Shot Tr... | 2022-12-09 | - |
| 3 | IcoCap (ViT-B/16) | 53.1 | Yes | - | - | - |
| 4 | IcoCap (ViT-B/32) | 52.5 | Yes | - | - | - |
| 5 | CoCap (ViT/L14) | 52 | No | Accurate and Fast Compressed Video Captioning | 2023-09-22 | Code |
| 6 | VASTA (Kinetics-backbone) | 51.88 | No | Diverse Video Captioning by Adaptive Spatio-temp... | 2022-08-19 | Code |
| 7 | ORG-TRL | 48.9 | Yes | Object Relational Graph with Teacher-Recommended... | 2020-02-26 | - |
| 8 | NITS-VC | 42 | No | NITS-VC System for VATEX Video Captioning Challe... | 2020-06-07 | - |