CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment

Hongwei Xue, Yuchong Sun, Bei Liu, Jianlong Fu, Ruihua Song, Houqiang Li, Jiebo Luo

2022-09-14Video Retrieval Video-Text Retrieval Text Retrieval Retrieval

Abstract

The pre-trained image-text models, like CLIP, have demonstrated the strong power of vision-language representation learned from a large scale of web-collected image-text data. In light of the well-learned visual features, some existing works transfer image representation to video domain and achieve good results. However, how to utilize image-language pre-trained model (e.g., CLIP) for video-language pre-training (post-pretraining) is still under explored. In this paper, we investigate two questions: 1) what are the factors hindering post-pretraining CLIP to further improve the performance on video-language tasks? and 2) how to mitigate the impact of these factors? Through a series of comparative experiments and analyses, we find that the data scale and domain gap between language sources have great impacts. Motivated by these, we propose a Omnisource Cross-modal Learning method equipped with a Video Proxy mechanism on the basis of CLIP, namely CLIP-ViP. Extensive results show that our approach improves the performance of CLIP on video-text retrieval by a large margin. Our model also achieves SOTA results on a variety of datasets, including MSR-VTT, DiDeMo, LSMDC, and ActivityNet. We will release our code and pre-trained CLIP-ViP models at https://github.com/microsoft/XPretrain/tree/main/CLIP-ViP.

Results

Task	Dataset	Metric	Value	Model
Video	MSR-VTT-1kA	text-to-video Median Rank	1	CLIP-ViP
Video	MSR-VTT-1kA	text-to-video R@1	57.7	CLIP-ViP
Video	MSR-VTT-1kA	text-to-video R@10	88.2	CLIP-ViP
Video	MSR-VTT-1kA	text-to-video R@5	80.5	CLIP-ViP
Video	ActivityNet	text-to-video Median Rank	1	CLIP-ViP
Video	ActivityNet	text-to-video R@1	61.4	CLIP-ViP
Video	ActivityNet	text-to-video R@10	92.6	CLIP-ViP
Video	ActivityNet	text-to-video R@5	85.7	CLIP-ViP
Video	DiDeMo	text-to-video Median Rank	1	CLIP-ViP
Video	DiDeMo	text-to-video R@1	55.3	CLIP-ViP
Video	DiDeMo	text-to-video R@10	89.3	CLIP-ViP
Video	DiDeMo	text-to-video R@5	82	CLIP-ViP
Video	LSMDC	text-to-video Median Rank	5	CLIP-ViP
Video	LSMDC	text-to-video R@1	30.7	CLIP-ViP
Video	LSMDC	text-to-video R@10	60.6	CLIP-ViP
Video	LSMDC	text-to-video R@5	51.4	CLIP-ViP
Video Retrieval	MSR-VTT-1kA	text-to-video Median Rank	1	CLIP-ViP
Video Retrieval	MSR-VTT-1kA	text-to-video R@1	57.7	CLIP-ViP
Video Retrieval	MSR-VTT-1kA	text-to-video R@10	88.2	CLIP-ViP
Video Retrieval	MSR-VTT-1kA	text-to-video R@5	80.5	CLIP-ViP
Video Retrieval	ActivityNet	text-to-video Median Rank	1	CLIP-ViP
Video Retrieval	ActivityNet	text-to-video R@1	61.4	CLIP-ViP
Video Retrieval	ActivityNet	text-to-video R@10	92.6	CLIP-ViP
Video Retrieval	ActivityNet	text-to-video R@5	85.7	CLIP-ViP
Video Retrieval	DiDeMo	text-to-video Median Rank	1	CLIP-ViP
Video Retrieval	DiDeMo	text-to-video R@1	55.3	CLIP-ViP
Video Retrieval	DiDeMo	text-to-video R@10	89.3	CLIP-ViP
Video Retrieval	DiDeMo	text-to-video R@5	82	CLIP-ViP
Video Retrieval	LSMDC	text-to-video Median Rank	5	CLIP-ViP
Video Retrieval	LSMDC	text-to-video R@1	30.7	CLIP-ViP
Video Retrieval	LSMDC	text-to-video R@10	60.6	CLIP-ViP
Video Retrieval	LSMDC	text-to-video R@5	51.4	CLIP-ViP

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment

Abstract

Results

Related Papers

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment

Abstract

Results

Related Papers