Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, Tianrui Li
Video-text retrieval plays an essential role in multi-modal research and has been widely used in many real-world web applications. The CLIP (Contrastive Language-Image Pre-training), an image-language pre-training model, has demonstrated the power of visual concepts learning from web collected image-text datasets. In this paper, we propose a CLIP4Clip model to transfer the knowledge of the CLIP model to video-language retrieval in an end-to-end manner. Several questions are investigated via empirical studies: 1) Whether image feature is enough for video-text retrieval? 2) How a post-pretraining on a large-scale video-text dataset based on the CLIP affect the performance? 3) What is the practical mechanism to model temporal dependency between video frames? And 4) The Hyper-parameters sensitivity of the model on video-text retrieval task. Extensive experimental results present that the CLIP4Clip model transferred from the CLIP can achieve SOTA results on various video-text retrieval datasets, including MSR-VTT, MSVC, LSMDC, ActivityNet, and DiDeMo. We release our code at https://github.com/ArrowLuo/CLIP4Clip.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | MSR-VTT-1kA | text-to-video Mean Rank | 15.3 | CLIP4Clip |
| Video | MSR-VTT-1kA | text-to-video Median Rank | 2 | CLIP4Clip |
| Video | MSR-VTT-1kA | text-to-video R@10 | 81.6 | CLIP4Clip |
| Video | MSR-VTT-1kA | video-to-text Median Rank | 2 | CLIP4Clip |
| Video | MSR-VTT-1kA | video-to-text R@1 | 42.7 | CLIP4Clip |
| Video | MSR-VTT-1kA | video-to-text R@10 | 80.6 | CLIP4Clip |
| Video | MSR-VTT-1kA | video-to-text R@5 | 70.9 | CLIP4Clip |
| Video | ActivityNet | text-to-video Mean Rank | 7.5 | CLIP4Clip |
| Video | ActivityNet | text-to-video Median Rank | 2 | CLIP4Clip |
| Video | ActivityNet | text-to-video R@1 | 40.5 | CLIP4Clip |
| Video | ActivityNet | text-to-video R@5 | 73.4 | CLIP4Clip |
| Video | ActivityNet | text-to-video R@50 | 98.2 | CLIP4Clip |
| Video | DiDeMo | text-to-video Mean Rank | 17.5 | CLIP4Clip |
| Video | DiDeMo | text-to-video Median Rank | 2 | CLIP4Clip |
| Video | DiDeMo | text-to-video R@1 | 43.4 | CLIP4Clip |
| Video | DiDeMo | text-to-video R@10 | 80.6 | CLIP4Clip |
| Video | DiDeMo | text-to-video R@5 | 70.2 | CLIP4Clip |
| Video | MSR-VTT | text-to-video R@1 | 44.5 | CLIP4Clip-seqTransf |
| Video | MSR-VTT | text-to-video R@10 | 81.6 | CLIP4Clip-seqTransf |
| Video | MSR-VTT | text-to-video R@5 | 71.4 | CLIP4Clip-seqTransf |
| Video | LSMDC | text-to-video Mean Rank | 58 | CLIP4Clip |
| Video | LSMDC | text-to-video R@1 | 21.6 | CLIP4Clip |
| Video | LSMDC | text-to-video R@10 | 49.8 | CLIP4Clip |
| Video | LSMDC | text-to-video R@5 | 41.8 | CLIP4Clip |
| Video | MSVD | text-to-video Mean Rank | 10 | CLIP4Clip |
| Video | MSVD | text-to-video Median Rank | 2 | CLIP4Clip |
| Video | MSVD | text-to-video R@1 | 46.2 | CLIP4Clip |
| Video | MSVD | text-to-video R@10 | 84.6 | CLIP4Clip |
| Video | MSVD | text-to-video R@5 | 76.1 | CLIP4Clip |
| Video | MSVD | video-to-text Median Rank | 1 | CLIP4Clip |
| Video | MSVD | video-to-text R@1 | 62 | CLIP4Clip |
| Video | MSVD | video-to-text R@10 | 92.6 | CLIP4Clip |
| Video | MSVD | video-to-text R@5 | 87.3 | CLIP4Clip |
| Video Retrieval | MSR-VTT-1kA | text-to-video Mean Rank | 15.3 | CLIP4Clip |
| Video Retrieval | MSR-VTT-1kA | text-to-video Median Rank | 2 | CLIP4Clip |
| Video Retrieval | MSR-VTT-1kA | text-to-video R@10 | 81.6 | CLIP4Clip |
| Video Retrieval | MSR-VTT-1kA | video-to-text Median Rank | 2 | CLIP4Clip |
| Video Retrieval | MSR-VTT-1kA | video-to-text R@1 | 42.7 | CLIP4Clip |
| Video Retrieval | MSR-VTT-1kA | video-to-text R@10 | 80.6 | CLIP4Clip |
| Video Retrieval | MSR-VTT-1kA | video-to-text R@5 | 70.9 | CLIP4Clip |
| Video Retrieval | ActivityNet | text-to-video Mean Rank | 7.5 | CLIP4Clip |
| Video Retrieval | ActivityNet | text-to-video Median Rank | 2 | CLIP4Clip |
| Video Retrieval | ActivityNet | text-to-video R@1 | 40.5 | CLIP4Clip |
| Video Retrieval | ActivityNet | text-to-video R@5 | 73.4 | CLIP4Clip |
| Video Retrieval | ActivityNet | text-to-video R@50 | 98.2 | CLIP4Clip |
| Video Retrieval | DiDeMo | text-to-video Mean Rank | 17.5 | CLIP4Clip |
| Video Retrieval | DiDeMo | text-to-video Median Rank | 2 | CLIP4Clip |
| Video Retrieval | DiDeMo | text-to-video R@1 | 43.4 | CLIP4Clip |
| Video Retrieval | DiDeMo | text-to-video R@10 | 80.6 | CLIP4Clip |
| Video Retrieval | DiDeMo | text-to-video R@5 | 70.2 | CLIP4Clip |
| Video Retrieval | MSR-VTT | text-to-video R@1 | 44.5 | CLIP4Clip-seqTransf |
| Video Retrieval | MSR-VTT | text-to-video R@10 | 81.6 | CLIP4Clip-seqTransf |
| Video Retrieval | MSR-VTT | text-to-video R@5 | 71.4 | CLIP4Clip-seqTransf |
| Video Retrieval | LSMDC | text-to-video Mean Rank | 58 | CLIP4Clip |
| Video Retrieval | LSMDC | text-to-video R@1 | 21.6 | CLIP4Clip |
| Video Retrieval | LSMDC | text-to-video R@10 | 49.8 | CLIP4Clip |
| Video Retrieval | LSMDC | text-to-video R@5 | 41.8 | CLIP4Clip |
| Video Retrieval | MSVD | text-to-video Mean Rank | 10 | CLIP4Clip |
| Video Retrieval | MSVD | text-to-video Median Rank | 2 | CLIP4Clip |
| Video Retrieval | MSVD | text-to-video R@1 | 46.2 | CLIP4Clip |
| Video Retrieval | MSVD | text-to-video R@10 | 84.6 | CLIP4Clip |
| Video Retrieval | MSVD | text-to-video R@5 | 76.1 | CLIP4Clip |
| Video Retrieval | MSVD | video-to-text Median Rank | 1 | CLIP4Clip |
| Video Retrieval | MSVD | video-to-text R@1 | 62 | CLIP4Clip |
| Video Retrieval | MSVD | video-to-text R@10 | 92.6 | CLIP4Clip |
| Video Retrieval | MSVD | video-to-text R@5 | 87.3 | CLIP4Clip |
| Text to Video Retrieval | MSR-VTT | text-to-video R@1 | 44.5 | CLIP4Clip |
| 10-shot image generation | MSR-VTT | text-to-video R@1 | 44.5 | CLIP4Clip |
| Zero-Shot Video Retrieval | MSR-VTT | text-to-video Mean Rank | 34 | CLIP4Clip |
| Zero-Shot Video Retrieval | MSR-VTT | text-to-video Median Rank | 4 | CLIP4Clip |
| Zero-Shot Video Retrieval | MSR-VTT | text-to-video R@1 | 32 | CLIP4Clip |
| Zero-Shot Video Retrieval | MSR-VTT | text-to-video R@10 | 66.9 | CLIP4Clip |
| Zero-Shot Video Retrieval | MSR-VTT | text-to-video R@5 | 57 | CLIP4Clip |
| Zero-Shot Video Retrieval | MSVD | text-to-video Mean Rank | 17.8 | CLIP4Clip |
| Zero-Shot Video Retrieval | MSVD | text-to-video Median Rank | 2 | CLIP4Clip |
| Zero-Shot Video Retrieval | MSVD | text-to-video R@1 | 38.5 | CLIP4Clip |
| Zero-Shot Video Retrieval | MSVD | text-to-video R@10 | 76.8 | CLIP4Clip |
| Zero-Shot Video Retrieval | MSVD | text-to-video R@5 | 66.9 | CLIP4Clip |
| Zero-Shot Video Retrieval | LSMDC | text-to-video Mean Rank | 117 | CLIP4Clip |
| Zero-Shot Video Retrieval | LSMDC | text-to-video Median Rank | 28 | CLIP4Clip |
| Zero-Shot Video Retrieval | LSMDC | text-to-video R@1 | 15.1 | CLIP4Clip |
| Zero-Shot Video Retrieval | LSMDC | text-to-video R@10 | 36.4 | CLIP4Clip |
| Zero-Shot Video Retrieval | LSMDC | text-to-video R@5 | 28.5 | CLIP4Clip |