Han Fang, Pengfei Xiong, Luhui Xu, Yu Chen
We present CLIP2Video network to transfer the image-language pre-training model to video-text retrieval in an end-to-end manner. Leading approaches in the domain of video-and-language learning try to distill the spatio-temporal video features and multi-modal interaction between videos and languages from a large-scale video-text dataset. Different from them, we leverage pretrained image-language model, simplify it as a two-stage framework with co-learning of image-text and enhancing temporal relations between video frames and video-text respectively, make it able to train on comparatively small datasets. Specifically, based on the spatial semantics captured by Contrastive Language-Image Pretraining (CLIP) model, our model involves a Temporal Difference Block to capture motions at fine temporal video frames, and a Temporal Alignment Block to re-align the tokens of video clips and phrases and enhance the multi-modal correlation. We conduct thorough ablation studies, and achieve state-of-the-art performance on major text-to-video and video-to-text retrieval benchmarks, including new records of retrieval accuracy on MSR-VTT, MSVD and VATEX.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | MSR-VTT-1kA | text-to-video Mean Rank | 14.6 | CLIP2Video |
| Video | MSR-VTT-1kA | text-to-video Median Rank | 2 | CLIP2Video |
| Video | MSR-VTT-1kA | text-to-video R@1 | 45.6 | CLIP2Video |
| Video | MSR-VTT-1kA | text-to-video R@10 | 81.7 | CLIP2Video |
| Video | MSR-VTT-1kA | text-to-video R@5 | 72.6 | CLIP2Video |
| Video | MSR-VTT-1kA | video-to-text Mean Rank | 10.2 | CLIP2Video |
| Video | MSR-VTT-1kA | video-to-text Median Rank | 2 | CLIP2Video |
| Video | MSR-VTT-1kA | video-to-text R@1 | 43.3 | CLIP2Video |
| Video | MSR-VTT-1kA | video-to-text R@10 | 82.1 | CLIP2Video |
| Video | MSR-VTT-1kA | video-to-text R@5 | 72.3 | CLIP2Video |
| Video | VATEX | text-to-video R@1 | 57.3 | CLIP2Video |
| Video | VATEX | text-to-video R@10 | 90 | CLIP2Video |
| Video | VATEX | text-to-video R@50 | 95.5 | CLIP2Video |
| Video | MSR-VTT | text-to-video Mean Rank | 45.4 | CLIP2Video |
| Video | MSR-VTT | text-to-video Median Rank | 4 | CLIP2Video |
| Video | MSR-VTT | text-to-video R@1 | 29.8 | CLIP2Video |
| Video | MSR-VTT | text-to-video R@10 | 66.2 | CLIP2Video |
| Video | MSR-VTT | text-to-video R@5 | 55.5 | CLIP2Video |
| Video | MSR-VTT | video-to-text Mean Rank | 5.3 | CLIP2Video |
| Video | MSR-VTT | video-to-text Median Rank | 1 | CLIP2Video |
| Video | MSR-VTT | video-to-text R@1 | 54.6 | CLIP2Video |
| Video | MSR-VTT | video-to-text R@10 | 90.8 | CLIP2Video |
| Video | MSR-VTT | video-to-text R@5 | 82.1 | CLIP2Video |
| Video Retrieval | MSR-VTT-1kA | text-to-video Mean Rank | 14.6 | CLIP2Video |
| Video Retrieval | MSR-VTT-1kA | text-to-video Median Rank | 2 | CLIP2Video |
| Video Retrieval | MSR-VTT-1kA | text-to-video R@1 | 45.6 | CLIP2Video |
| Video Retrieval | MSR-VTT-1kA | text-to-video R@10 | 81.7 | CLIP2Video |
| Video Retrieval | MSR-VTT-1kA | text-to-video R@5 | 72.6 | CLIP2Video |
| Video Retrieval | MSR-VTT-1kA | video-to-text Mean Rank | 10.2 | CLIP2Video |
| Video Retrieval | MSR-VTT-1kA | video-to-text Median Rank | 2 | CLIP2Video |
| Video Retrieval | MSR-VTT-1kA | video-to-text R@1 | 43.3 | CLIP2Video |
| Video Retrieval | MSR-VTT-1kA | video-to-text R@10 | 82.1 | CLIP2Video |
| Video Retrieval | MSR-VTT-1kA | video-to-text R@5 | 72.3 | CLIP2Video |
| Video Retrieval | VATEX | text-to-video R@1 | 57.3 | CLIP2Video |
| Video Retrieval | VATEX | text-to-video R@10 | 90 | CLIP2Video |
| Video Retrieval | VATEX | text-to-video R@50 | 95.5 | CLIP2Video |
| Video Retrieval | MSR-VTT | text-to-video Mean Rank | 45.4 | CLIP2Video |
| Video Retrieval | MSR-VTT | text-to-video Median Rank | 4 | CLIP2Video |
| Video Retrieval | MSR-VTT | text-to-video R@1 | 29.8 | CLIP2Video |
| Video Retrieval | MSR-VTT | text-to-video R@10 | 66.2 | CLIP2Video |
| Video Retrieval | MSR-VTT | text-to-video R@5 | 55.5 | CLIP2Video |
| Video Retrieval | MSR-VTT | video-to-text Mean Rank | 5.3 | CLIP2Video |
| Video Retrieval | MSR-VTT | video-to-text Median Rank | 1 | CLIP2Video |
| Video Retrieval | MSR-VTT | video-to-text R@1 | 54.6 | CLIP2Video |
| Video Retrieval | MSR-VTT | video-to-text R@10 | 90.8 | CLIP2Video |
| Video Retrieval | MSR-VTT | video-to-text R@5 | 82.1 | CLIP2Video |