Metric: video-to-text R@5 (higher is better)
| # | Model↕ | video-to-text R@5▼ | Extra Data | Paper | Date↕ | Code |
|---|---|---|---|---|---|---|
| 1 | HunYuan_tvr (huge) | 71.8 | Yes | Tencent Text-Video Retrieval: Hierarchical Cross... | 2022-04-07 | - |
| 2 | vid-TLDR (UMT-L) | 70.2 | Yes | vid-TLDR: Training Free Token merging for Light-... | 2024-03-20 | Code |
| 3 | UMT-L (ViT-L/16) | 64.3 | Yes | Unmasked Teacher: Towards Training-Efficient Vid... | 2023-03-28 | Code |
| 4 | HunYuan_tvr | 47.5 | Yes | Tencent Text-Video Retrieval: Hierarchical Cross... | 2022-04-07 | - |
| 5 | CenterCLIP (ViT-B/16) | 46.4 | Yes | CenterCLIP: Token Clustering for Efficient Text-... | 2022-05-02 | Code |
| 6 | EMCL-Net++ | 44.7 | No | Expectation-Maximization Contrastive Learning fo... | 2022-11-21 | Code |
| 7 | DiffusionRet | 43.5 | No | DiffusionRet: Generative Text-Video Retrieval wi... | 2023-03-17 | Code |
| 8 | X-Pool | 42.6 | Yes | X-Pool: Cross-Modal Language-Video Attention for... | 2022-03-28 | Code |
| 9 | EMCL-Net | 40.6 | No | Expectation-Maximization Contrastive Learning fo... | 2022-11-21 | Code |
| 10 | Ours | 34.1 | No | Video and Text Matching with Conditioned Embeddi... | 2021-10-21 | Code |
| 11 | CLIP | 16.4 | No | A Straightforward Framework For Video Retrieval ... | 2021-02-24 | Code |