Metric: text-to-video R@10 (higher is better)
| # | Model↕ | text-to-video R@10▼ | Extra Data | Paper | Date↕ | Code |
|---|---|---|---|---|---|---|
| 1 | OmniVec2 | 70.8 | No | - | - | - |
| 2 | Norton | 64.1 | No | Multi-granularity Correspondence Learning from L... | 2024-01-30 | Code |
| 3 | VideoCLIP | 63.1 | No | VideoCLIP: Contrastive Pre-training for Zero-sho... | 2021-09-28 | Code |
| 4 | TACo | 55.7 | No | TACo: Token-aware Cascade Contrastive Learning f... | 2021-08-23 | - |
| 5 | VAST, HowToCaption-finetuned | 53.9 | No | HowToCaption: Prompting LLMs to Transform Video ... | 2023-10-07 | Code |
| 6 | VideoCOca | 53.3 | No | VideoCoCa: Video-Text Modeling with Zero-Shot Tr... | 2022-12-09 | - |
| 7 | MIL-NCE | 51.2 | No | End-to-End Learning of Visual Representations fr... | 2019-12-13 | Code |
| 8 | VATT-MBS | 45.5 | No | VATT: Transformers for Multimodal Self-Supervise... | 2021-04-22 | Code |
| 9 | HowToCaption | 44.1 | No | HowToCaption: Prompting LLMs to Transform Video ... | 2023-10-07 | Code |