Metric: text-to-video Median Rank (higher is better)
| # | Model↕ | text-to-video Median Rank▼ | Extra Data | Paper | Date↕ | Code |
|---|---|---|---|---|---|---|
| 1 | Satar et al. | 77 | No | Semantic Role Aware Correlation Transformer for ... | 2022-06-26 | Code |
| 2 | HGLMM FV CCA | 75 | No | - | - | - |
| 3 | RoME | 53 | No | RoME: Role-aware Mixture-of-Expert Transformer f... | 2022-06-26 | Code |
| 4 | Text-Video Embedding | 24 | No | HowTo100M: Learning a Text-Video Embedding by Wa... | 2019-06-07 | Code |
| 5 | COOT | 9 | No | COOT: Cooperative Hierarchical Transformer for V... | 2020-11-01 | Code |
| 6 | TACo | 4 | Yes | TACo: Token-aware Cascade Contrastive Learning f... | 2021-08-23 | - |
| 7 | UniVL | 4 | Yes | UniVL: A Unified Video and Language Pre-Training... | 2020-02-15 | Code |
| 8 | VLM | 4 | Yes | VLM: Task-agnostic Video-Language Model Pre-trai... | 2021-05-20 | Code |
| 9 | UniVL + MELTR | 3 | No | MELTR: Meta Loss Transformer for Learning to Fin... | 2023-03-23 | Code |
| 10 | MDMMT-2 | 3 | Yes | MDMMT-2: Multidomain Multimodal Transformer for ... | 2022-03-14 | - |