Metric: video-to-text R@1 (higher is better)
| # | Model↕ | video-to-text R@1▼ | Extra Data | Paper | Date↕ | Code |
|---|---|---|---|---|---|---|
| 1 | InternVideo2-6B | 53.7 | Yes | InternVideo2: Scaling Foundation Models for Mult... | 2024-03-22 | Code |
| 2 | GRAM | 52.9 | Yes | Gramian Multimodal Representation Learning and A... | 2024-12-16 | Code |
| 3 | InternVideo2-1B | 50.9 | Yes | InternVideo2: Scaling Foundation Models for Mult... | 2024-03-22 | Code |
| 4 | FluxViT-B | 49.4 | Yes | Make Your Training Flexible: Towards Deployment-... | 2025-03-18 | Code |
| 5 | FluxViT-S | 44.9 | Yes | Make Your Training Flexible: Towards Deployment-... | 2025-03-18 | Code |
| 6 | LanguageBind(ViT-H/14) | 40.9 | Yes | LanguageBind: Extending Video-Language Pretraini... | 2023-10-03 | Code |
| 7 | InternVideo | 39.6 | Yes | InternVideo: General Video Foundation Models via... | 2022-12-06 | Code |
| 8 | UMT-L (ViT-L/16) | 38.6 | Yes | Unmasked Teacher: Towards Training-Efficient Vid... | 2023-03-28 | Code |
| 9 | LanguageBind(ViT-L/14) | 38.3 | Yes | LanguageBind: Extending Video-Language Pretraini... | 2023-10-03 | Code |
| 10 | vid-TLDR (UMT-L) | 37.7 | Yes | vid-TLDR: Training Free Token merging for Light-... | 2024-03-20 | Code |
| 11 | LaT | 17.2 | No | LaT: Latent Translation with Cycle-Consistency f... | 2022-07-11 | - |