Metric: METEOR (higher is better)
| # | Model↕ | METEOR▼ | Extra Data | Paper | Date↕ | Code |
|---|---|---|---|---|---|---|
| 1 | VLAB | 51.2 | Yes | VLAB: Enhancing Video Language Pre-training by F... | 2023-05-22 | - |
| 2 | VALOR | 51 | Yes | VALOR: Vision-Audio-Language Omni-Perception Pre... | 2023-04-17 | Code |
| 3 | mPLUG-2 | 48.4 | No | mPLUG-2: A Modularized Multi-modal Foundation Mo... | 2023-02-01 | Code |
| 4 | HowToCaption | 46.4 | No | HowToCaption: Prompting LLMs to Transform Video ... | 2023-10-07 | Code |
| 5 | HiTeA | 45.3 | Yes | HiTeA: Hierarchical Temporal-Aware Video-Languag... | 2022-12-30 | - |
| 6 | Vid2Seq | 45.3 | Yes | Vid2Seq: Large-Scale Pretraining of a Visual Lan... | 2023-02-27 | Code |
| 7 | CoCap (ViT/L14) | 41.4 | No | Accurate and Fast Compressed Video Captioning | 2023-09-22 | Code |
| 8 | VASTA (Vatex-backbone) | 40.65 | No | Diverse Video Captioning by Adaptive Spatio-temp... | 2022-08-19 | Code |
| 9 | IcoCap (ViT-B/16) | 39.5 | Yes | - | - | - |
| 10 | VASTA (Kinetics-backbone) | 39.1 | No | Diverse Video Captioning by Adaptive Spatio-temp... | 2022-08-19 | Code |
| 11 | IcoCap (ViT-B/32) | 38.9 | Yes | - | - | - |
| 12 | SEM-POS | 38.5 | No | SEM-POS: Grammatically and Semantically Correct ... | 2023-03-26 | - |