Metric: text-to-video Mean Rank (higher is better)
| # | Model↕ | text-to-video Mean Rank▼ | Extra Data | Paper | Date↕ | Code |
|---|---|---|---|---|---|---|
| 1 | CLIP4Clip | 58 | Yes | CLIP4Clip: An Empirical Study of CLIP for End to... | 2021-04-18 | Code |
| 2 | MDMMT | 58 | Yes | MDMMT: Multidomain Multimodal Transformer for Vi... | 2021-03-19 | Code |
| 3 | HunYuan_tvr | 56.4 | Yes | Tencent Text-Video Retrieval: Hierarchical Cross... | 2022-04-07 | - |
| 4 | CAMoE | 54.4 | Yes | Improving Video-Text Retrieval by Multi-Stream C... | 2021-09-09 | Code |
| 5 | X-Pool | 53.2 | Yes | X-Pool: Cross-Modal Language-Video Attention for... | 2022-03-28 | Code |
| 6 | MDMMT-2 | 48 | Yes | MDMMT-2: Multidomain Multimodal Transformer for ... | 2022-03-14 | - |
| 7 | CenterCLIP (ViT-B/16) | 47.3 | Yes | CenterCLIP: Token Clustering for Efficient Text-... | 2022-05-02 | Code |
| 8 | DiffusionRet | 40.7 | No | DiffusionRet: Generative Text-Video Retrieval wi... | 2023-03-17 | Code |
| 9 | EMCL-Net (Ours)++ LSMDC Rohrbach et al. (2015) | 8 | No | Expectation-Maximization Contrastive Learning fo... | 2022-11-21 | Code |
| 10 | HunYuan_tvr (huge) | 3.9 | Yes | Tencent Text-Video Retrieval: Hierarchical Cross... | 2022-04-07 | - |