Metric: BLEU-4 (higher is better)
| # | Model↕ | BLEU-4▼ | Extra Data | Paper | Date↕ | Code |
|---|---|---|---|---|---|---|
| 1 | VALOR | 45.6 | Yes | VALOR: Vision-Audio-Language Omni-Perception Pre... | 2023-04-17 | Code |
| 2 | VAST | 45 | Yes | VAST: A Vision-Audio-Subtitle-Text Omni-Modality... | 2023-05-29 | Code |
| 3 | COSA | 43.7 | Yes | COSA: Concatenated Sample Pretrained Vision-Lang... | 2023-06-15 | Code |
| 4 | VideoCoCa | 39.7 | Yes | VideoCoCa: Video-Text Modeling with Zero-Shot Tr... | 2022-12-09 | - |
| 5 | IcoCap (ViT-B/16) | 37.4 | Yes | - | - | - |
| 6 | IcoCap (ViT-B/32) | 36.9 | Yes | - | - | - |
| 7 | VASTA (Kinetics-backbone) | 36.25 | No | Diverse Video Captioning by Adaptive Spatio-temp... | 2022-08-19 | Code |
| 8 | CoCap (ViT/L14) | 35.8 | No | Accurate and Fast Compressed Video Captioning | 2023-09-22 | Code |
| 9 | ORG-TRL | 32.1 | Yes | Object Relational Graph with Teacher-Recommended... | 2020-02-26 | - |
| 10 | NITS-VC | 20 | No | NITS-VC System for VATEX Video Captioning Challe... | 2020-06-07 | - |