Metric: BLEU-4 (higher is better)
| # | Model↕ | BLEU-4▼ | Extra Data | Paper | Date↕ | Code |
|---|---|---|---|---|---|---|
| 1 | VALOR | 80.7 | Yes | VALOR: Vision-Audio-Language Omni-Perception Pre... | 2023-04-17 | Code |
| 2 | VLAB | 79.3 | Yes | VLAB: Enhancing Video Language Pre-training by F... | 2023-05-22 | - |
| 3 | COSA | 76.5 | Yes | COSA: Concatenated Sample Pretrained Vision-Lang... | 2023-06-15 | Code |
| 4 | HiTeA | 71 | Yes | HiTeA: Hierarchical Temporal-Aware Video-Languag... | 2022-12-30 | - |
| 5 | mPLUG-2 | 70.5 | No | mPLUG-2: A Modularized Multi-modal Foundation Mo... | 2023-02-01 | Code |
| 6 | HowToCaption | 70.4 | No | HowToCaption: Prompting LLMs to Transform Video ... | 2023-10-07 | Code |
| 7 | RTQ | 66.9 | No | RTQ: Rethinking Video-language Understanding Bas... | 2023-12-01 | Code |
| 8 | CoCap (ViT/L14) | 60.1 | No | Accurate and Fast Compressed Video Captioning | 2023-09-22 | Code |
| 9 | SEM-POS | 60.1 | No | SEM-POS: Grammatically and Semantically Correct ... | 2023-03-26 | - |
| 10 | VASTA (Vatex-backbone) | 59.2 | No | Diverse Video Captioning by Adaptive Spatio-temp... | 2022-08-19 | Code |
| 11 | IcoCap (ViT-B/16) | 59.1 | Yes | - | - | - |
| 12 | IcoCap (ViT-B/32) | 56.3 | Yes | - | - | - |
| 13 | VASTA (Kinetics-backbone) | 56.1 | No | Diverse Video Captioning by Adaptive Spatio-temp... | 2022-08-19 | Code |