Metric: video-to-text R@10 (higher is better)
| # | Model↕ | video-to-text R@10▼ | Extra Data | Paper | Date↕ | Code |
|---|---|---|---|---|---|---|
| 1 | CAMoE | 92.8 | Yes | Improving Video-Text Retrieval by Multi-Stream C... | 2021-09-09 | Code |
| 2 | GRAM | 91.5 | Yes | Gramian Multimodal Representation Learning and A... | 2024-12-16 | Code |
| 3 | VideoCoCa (zero-shot) | 91.4 | Yes | VideoCoCa: Video-Text Modeling with Zero-Shot Tr... | 2022-12-09 | - |
| 4 | CLIP2Video | 90.8 | Yes | CLIP2Video: Mastering Video-Text Retrieval via I... | 2021-06-21 | Code |
| 5 | vid-TLDR (UMT-L) | 86.9 | Yes | vid-TLDR: Training Free Token merging for Light-... | 2024-03-20 | Code |
| 6 | UMT-L (ViT-L/16) | 86.5 | Yes | Unmasked Teacher: Towards Training-Efficient Vid... | 2023-03-28 | Code |
| 7 | CoCa (zero-shot) | 81.4 | Yes | CoCa: Contrastive Captioners are Image-Text Foun... | 2022-05-04 | Code |
| 8 | CLIP | 79.2 | No | A Straightforward Framework For Video Retrieval ... | 2021-02-24 | Code |
| 9 | Collaborative Experts | 55.2 | No | Use What You Have: Video Retrieval Using Represe... | 2019-07-31 | Code |
| 10 | JEMC | 42.2 | No | - | - | Code |