Metric: text-to-video R@10 (higher is better)
| # | Model↕ | text-to-video R@10▼ | Extra Data | Paper | Date↕ | Code |
|---|---|---|---|---|---|---|
| 1 | GRAM | 100 | Yes | Gramian Multimodal Representation Learning and A... | 2024-12-16 | Code |
| 2 | VAST | 99.2 | Yes | VAST: A Vision-Audio-Subtitle-Text Omni-Modality... | 2023-05-29 | Code |
| 3 | VALOR | 98.7 | Yes | VALOR: Vision-Audio-Language Omni-Perception Pre... | 2023-04-17 | Code |
| 4 | Unmasked Teacher | 97.8 | No | Unmasked Teacher: Towards Training-Efficient Vid... | 2023-03-28 | Code |
| 5 | Side4Video | 97 | No | Side4Video: Spatial-Temporal Side Network for Me... | 2023-11-27 | Code |
| 6 | Cap4Video | 97 | No | Cap4Video: What Can Auxiliary Captions Do for Te... | 2022-12-31 | Code |
| 7 | TeachCLIP | 96.1 | No | - | - | Code |
| 8 | TS2-Net | 95.2 | No | TS2-Net: Token Shift and Selection Transformer f... | 2022-07-16 | Code |
| 9 | QB-Norm+CLIP2Video | 93.8 | Yes | Cross Modal Retrieval with Querybank Normalisation | 2021-12-23 | Code |
| 10 | LAFF | 91.7 | No | Lightweight Attentional Feature Fusion: A New Ba... | 2021-12-03 | Code |
| 11 | CLIP2Video | 90 | Yes | CLIP2Video: Mastering Video-Text Retrieval via I... | 2021-06-21 | Code |