TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-L...

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment

Hongwei Xue, Yuchong Sun, Bei Liu, Jianlong Fu, Ruihua Song, Houqiang Li, Jiebo Luo

2022-09-14Video RetrievalVideo-Text RetrievalText RetrievalRetrieval
PaperPDFCode(official)

Abstract

The pre-trained image-text models, like CLIP, have demonstrated the strong power of vision-language representation learned from a large scale of web-collected image-text data. In light of the well-learned visual features, some existing works transfer image representation to video domain and achieve good results. However, how to utilize image-language pre-trained model (e.g., CLIP) for video-language pre-training (post-pretraining) is still under explored. In this paper, we investigate two questions: 1) what are the factors hindering post-pretraining CLIP to further improve the performance on video-language tasks? and 2) how to mitigate the impact of these factors? Through a series of comparative experiments and analyses, we find that the data scale and domain gap between language sources have great impacts. Motivated by these, we propose a Omnisource Cross-modal Learning method equipped with a Video Proxy mechanism on the basis of CLIP, namely CLIP-ViP. Extensive results show that our approach improves the performance of CLIP on video-text retrieval by a large margin. Our model also achieves SOTA results on a variety of datasets, including MSR-VTT, DiDeMo, LSMDC, and ActivityNet. We will release our code and pre-trained CLIP-ViP models at https://github.com/microsoft/XPretrain/tree/main/CLIP-ViP.

Results

TaskDatasetMetricValueModel
VideoMSR-VTT-1kAtext-to-video Median Rank1CLIP-ViP
VideoMSR-VTT-1kAtext-to-video R@157.7CLIP-ViP
VideoMSR-VTT-1kAtext-to-video R@1088.2CLIP-ViP
VideoMSR-VTT-1kAtext-to-video R@580.5CLIP-ViP
VideoActivityNettext-to-video Median Rank1CLIP-ViP
VideoActivityNettext-to-video R@161.4CLIP-ViP
VideoActivityNettext-to-video R@1092.6CLIP-ViP
VideoActivityNettext-to-video R@585.7CLIP-ViP
VideoDiDeMotext-to-video Median Rank1CLIP-ViP
VideoDiDeMotext-to-video R@155.3CLIP-ViP
VideoDiDeMotext-to-video R@1089.3CLIP-ViP
VideoDiDeMotext-to-video R@582CLIP-ViP
VideoLSMDCtext-to-video Median Rank5CLIP-ViP
VideoLSMDCtext-to-video R@130.7CLIP-ViP
VideoLSMDCtext-to-video R@1060.6CLIP-ViP
VideoLSMDCtext-to-video R@551.4CLIP-ViP
Video RetrievalMSR-VTT-1kAtext-to-video Median Rank1CLIP-ViP
Video RetrievalMSR-VTT-1kAtext-to-video R@157.7CLIP-ViP
Video RetrievalMSR-VTT-1kAtext-to-video R@1088.2CLIP-ViP
Video RetrievalMSR-VTT-1kAtext-to-video R@580.5CLIP-ViP
Video RetrievalActivityNettext-to-video Median Rank1CLIP-ViP
Video RetrievalActivityNettext-to-video R@161.4CLIP-ViP
Video RetrievalActivityNettext-to-video R@1092.6CLIP-ViP
Video RetrievalActivityNettext-to-video R@585.7CLIP-ViP
Video RetrievalDiDeMotext-to-video Median Rank1CLIP-ViP
Video RetrievalDiDeMotext-to-video R@155.3CLIP-ViP
Video RetrievalDiDeMotext-to-video R@1089.3CLIP-ViP
Video RetrievalDiDeMotext-to-video R@582CLIP-ViP
Video RetrievalLSMDCtext-to-video Median Rank5CLIP-ViP
Video RetrievalLSMDCtext-to-video R@130.7CLIP-ViP
Video RetrievalLSMDCtext-to-video R@1060.6CLIP-ViP
Video RetrievalLSMDCtext-to-video R@551.4CLIP-ViP

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16Context-Aware Search and Retrieval Over Erasure Channels2025-07-16Seq vs Seq: An Open Suite of Paired Encoders and Decoders2025-07-15