Jesús Andrés Portillo-Quintero, José Carlos Ortiz-Bayliss, Hugo Terashima-Marín
Video Retrieval is a challenging task where a text query is matched to a video or vice versa. Most of the existing approaches for addressing such a problem rely on annotations made by the users. Although simple, this approach is not always feasible in practice. In this work, we explore the application of the language-image model, CLIP, to obtain video representations without the need for said annotations. This model was explicitly trained to learn a common space where images and text can be compared. Using various techniques described in this document, we extended its application to videos, obtaining state-of-the-art results on the MSR-VTT and MSVD benchmarks.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | MSR-VTT-1kA | text-to-video Median Rank | 4 | CLIP |
| Video | MSR-VTT-1kA | text-to-video R@1 | 31.2 | CLIP |
| Video | MSR-VTT-1kA | text-to-video R@10 | 64.2 | CLIP |
| Video | MSR-VTT-1kA | text-to-video R@5 | 53.7 | CLIP |
| Video | MSR-VTT-1kA | video-to-text Median Rank | 5 | CLIP |
| Video | MSR-VTT-1kA | video-to-text R@1 | 27.2 | CLIP |
| Video | MSR-VTT-1kA | video-to-text R@10 | 62.6 | CLIP |
| Video | MSR-VTT-1kA | video-to-text R@5 | 51.7 | CLIP |
| Video | MSR-VTT | text-to-video Median Rank | 10 | CLIP |
| Video | MSR-VTT | text-to-video R@1 | 21.4 | CLIP |
| Video | MSR-VTT | text-to-video R@10 | 50.4 | CLIP |
| Video | MSR-VTT | text-to-video R@5 | 41.1 | CLIP |
| Video | MSR-VTT | video-to-text Median Rank | 2 | CLIP |
| Video | MSR-VTT | video-to-text R@1 | 40.3 | CLIP |
| Video | MSR-VTT | video-to-text R@10 | 79.2 | CLIP |
| Video | MSR-VTT | video-to-text R@5 | 69.7 | CLIP |
| Video | LSMDC | text-to-video Median Rank | 56.5 | CLIP |
| Video | LSMDC | text-to-video R@1 | 11.3 | CLIP |
| Video | LSMDC | text-to-video R@10 | 29.2 | CLIP |
| Video | LSMDC | text-to-video R@5 | 22.7 | CLIP |
| Video | LSMDC | video-to-text Median Rank | 73 | CLIP |
| Video | LSMDC | video-to-text R@1 | 6.8 | CLIP |
| Video | LSMDC | video-to-text R@10 | 22.1 | CLIP |
| Video | LSMDC | video-to-text R@5 | 16.4 | CLIP |
| Video | MSVD | text-to-video Median Rank | 3 | CLIP |
| Video | MSVD | text-to-video R@1 | 37 | CLIP |
| Video | MSVD | text-to-video R@10 | 73.8 | CLIP |
| Video | MSVD | text-to-video R@5 | 64.1 | CLIP |
| Video | MSVD | video-to-text Median Rank | 1 | CLIP |
| Video | MSVD | video-to-text R@1 | 59.9 | CLIP |
| Video | MSVD | video-to-text R@10 | 90.7 | CLIP |
| Video | MSVD | video-to-text R@5 | 85.2 | CLIP |
| Image Retrieval | ConQA Conceptual | R-precision | 6.8 | CLIP |
| Video Retrieval | MSR-VTT-1kA | text-to-video Median Rank | 4 | CLIP |
| Video Retrieval | MSR-VTT-1kA | text-to-video R@1 | 31.2 | CLIP |
| Video Retrieval | MSR-VTT-1kA | text-to-video R@10 | 64.2 | CLIP |
| Video Retrieval | MSR-VTT-1kA | text-to-video R@5 | 53.7 | CLIP |
| Video Retrieval | MSR-VTT-1kA | video-to-text Median Rank | 5 | CLIP |
| Video Retrieval | MSR-VTT-1kA | video-to-text R@1 | 27.2 | CLIP |
| Video Retrieval | MSR-VTT-1kA | video-to-text R@10 | 62.6 | CLIP |
| Video Retrieval | MSR-VTT-1kA | video-to-text R@5 | 51.7 | CLIP |
| Video Retrieval | MSR-VTT | text-to-video Median Rank | 10 | CLIP |
| Video Retrieval | MSR-VTT | text-to-video R@1 | 21.4 | CLIP |
| Video Retrieval | MSR-VTT | text-to-video R@10 | 50.4 | CLIP |
| Video Retrieval | MSR-VTT | text-to-video R@5 | 41.1 | CLIP |
| Video Retrieval | MSR-VTT | video-to-text Median Rank | 2 | CLIP |
| Video Retrieval | MSR-VTT | video-to-text R@1 | 40.3 | CLIP |
| Video Retrieval | MSR-VTT | video-to-text R@10 | 79.2 | CLIP |
| Video Retrieval | MSR-VTT | video-to-text R@5 | 69.7 | CLIP |
| Video Retrieval | LSMDC | text-to-video Median Rank | 56.5 | CLIP |
| Video Retrieval | LSMDC | text-to-video R@1 | 11.3 | CLIP |
| Video Retrieval | LSMDC | text-to-video R@10 | 29.2 | CLIP |
| Video Retrieval | LSMDC | text-to-video R@5 | 22.7 | CLIP |
| Video Retrieval | LSMDC | video-to-text Median Rank | 73 | CLIP |
| Video Retrieval | LSMDC | video-to-text R@1 | 6.8 | CLIP |
| Video Retrieval | LSMDC | video-to-text R@10 | 22.1 | CLIP |
| Video Retrieval | LSMDC | video-to-text R@5 | 16.4 | CLIP |
| Video Retrieval | MSVD | text-to-video Median Rank | 3 | CLIP |
| Video Retrieval | MSVD | text-to-video R@1 | 37 | CLIP |
| Video Retrieval | MSVD | text-to-video R@10 | 73.8 | CLIP |
| Video Retrieval | MSVD | text-to-video R@5 | 64.1 | CLIP |
| Video Retrieval | MSVD | video-to-text Median Rank | 1 | CLIP |
| Video Retrieval | MSVD | video-to-text R@1 | 59.9 | CLIP |
| Video Retrieval | MSVD | video-to-text R@10 | 90.7 | CLIP |
| Video Retrieval | MSVD | video-to-text R@5 | 85.2 | CLIP |