A. Nagrani et. al.

Reported on 3 benchmarks across 1 task · 1 paper

Note: results are matched by exact model name. Different papers may use the same name for different model variants.

Computer Vision3 results

Zero-Shot Video RetrievalonMSR-VTT
text-to-video R@1· uses extra data· 2022-04-01
19.4
best: 55.9 (InternVideo2-6B)
Learning Audio-Video Modalities from Image Captions arXiv:2204.00679
Zero-Shot Video RetrievalonMSR-VTT
text-to-video R@10· uses extra data· 2022-04-01
50.3
best: 85.1 (InternVideo2-6B)
Learning Audio-Video Modalities from Image Captions arXiv:2204.00679
Zero-Shot Video RetrievalonMSR-VTT
text-to-video R@5· uses extra data· 2022-04-01
39.5
best: 78.3 (InternVideo2-6B)
Learning Audio-Video Modalities from Image Captions arXiv:2204.00679