Metric: text-to-video R@5 (higher is better)
| # | Model↕ | text-to-video R@5▼ | Extra Data | Paper | Date↕ | Code |
|---|---|---|---|---|---|---|
| 1 | InternVideo2-6B | 85.6 | Yes | InternVideo2: Scaling Foundation Models for Mult... | 2024-03-22 | Code |
| 2 | InternVideo2-1B | 83.9 | Yes | InternVideo2: Scaling Foundation Models for Mult... | 2024-03-22 | Code |
| 3 | UMT-L (ViT-L/16) | 69.6 | Yes | Unmasked Teacher: Towards Training-Efficient Vid... | 2023-03-28 | Code |
| 4 | vid-TLDR (UMT-L) | 69.4 | Yes | vid-TLDR: Training Free Token merging for Light-... | 2024-03-20 | Code |
| 5 | LanguageBind(ViT-H/14) | 68.4 | Yes | LanguageBind: Extending Video-Language Pretraini... | 2023-10-03 | Code |
| 6 | BT-Adapter | 66.7 | Yes | BT-Adapter: Video Conversation is Feasible Witho... | 2023-09-27 | Code |
| 7 | LanguageBind(ViT-L/14) | 66.6 | Yes | LanguageBind: Extending Video-Language Pretraini... | 2023-10-03 | Code |
| 8 | VideoCoCa | 63.2 | Yes | VideoCoCa: Video-Text Modeling with Zero-Shot Tr... | 2022-12-09 | - |
| 9 | Singularity-temporal-5M | 55.9 | Yes | Revealing Single Frame Bias for Video-and-Langua... | 2022-06-07 | Code |
| 10 | Singularity-temporal-17M | 55.6 | Yes | Revealing Single Frame Bias for Video-and-Langua... | 2022-06-07 | Code |