Metric: text-to-video Median Rank (higher is better)
| # | Model↕ | text-to-video Median Rank▼ | Extra Data | Paper | Date↕ | Code |
|---|---|---|---|---|---|---|
| 1 | HowToCaption | 2 | No | HowToCaption: Prompting LLMs to Transform Video ... | 2023-10-07 | Code |
| 2 | MILES | 2 | No | MILES: Visual BERT Pre-training with Injected La... | 2022-04-26 | Code |
| 3 | Y. Ge et. al. | 2 | No | Bridging Video-text Retrieval with Multiple Choi... | 2022-01-13 | Code |
| 4 | CLIP4Clip | 2 | No | CLIP4Clip: An Empirical Study of CLIP for End to... | 2021-04-18 | Code |
| 5 | LaT | 2 | No | LaT: Latent Translation with Cycle-Consistency f... | 2022-07-11 | - |
| 6 | VAST, HowToCaption-finetuned | 1 | No | HowToCaption: Prompting LLMs to Transform Video ... | 2023-10-07 | Code |
| 7 | LanguageBind(ViT-L/14) | 1 | Yes | LanguageBind: Extending Video-Language Pretraini... | 2023-10-03 | Code |
| 8 | LanguageBind(ViT-H/14) | 1 | Yes | LanguageBind: Extending Video-Language Pretraini... | 2023-10-03 | Code |