Metric: text-to-video R@10 (higher is better)
| # | Model↕ | text-to-video R@10▼ | Extra Data | Paper | Date↕ | Code |
|---|---|---|---|---|---|---|
| 1 | GRAM | 99.5 | Yes | Gramian Multimodal Representation Learning and A... | 2024-12-16 | Code |
| 2 | InternVideo2-6B | 97.1 | Yes | InternVideo2: Scaling Foundation Models for Mult... | 2024-03-22 | Code |
| 3 | InternVideo2-1B | 96.9 | Yes | InternVideo2: Scaling Foundation Models for Mult... | 2024-03-22 | Code |
| 4 | VideoCoCa | 90.1 | Yes | VideoCoCa: Video-Text Modeling with Zero-Shot Tr... | 2022-12-09 | - |