TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/CLIP2Video: Mastering Video-Text Retrieval via Image CLIP

CLIP2Video: Mastering Video-Text Retrieval via Image CLIP

Han Fang, Pengfei Xiong, Luhui Xu, Yu Chen

2021-06-21Video RetrievalVideo-Text RetrievalText RetrievalRetrievalLanguage ModellingVideo to Text Retrieval
PaperPDFCode(official)

Abstract

We present CLIP2Video network to transfer the image-language pre-training model to video-text retrieval in an end-to-end manner. Leading approaches in the domain of video-and-language learning try to distill the spatio-temporal video features and multi-modal interaction between videos and languages from a large-scale video-text dataset. Different from them, we leverage pretrained image-language model, simplify it as a two-stage framework with co-learning of image-text and enhancing temporal relations between video frames and video-text respectively, make it able to train on comparatively small datasets. Specifically, based on the spatial semantics captured by Contrastive Language-Image Pretraining (CLIP) model, our model involves a Temporal Difference Block to capture motions at fine temporal video frames, and a Temporal Alignment Block to re-align the tokens of video clips and phrases and enhance the multi-modal correlation. We conduct thorough ablation studies, and achieve state-of-the-art performance on major text-to-video and video-to-text retrieval benchmarks, including new records of retrieval accuracy on MSR-VTT, MSVD and VATEX.

Results

TaskDatasetMetricValueModel
VideoMSR-VTT-1kAtext-to-video Mean Rank14.6CLIP2Video
VideoMSR-VTT-1kAtext-to-video Median Rank2CLIP2Video
VideoMSR-VTT-1kAtext-to-video R@145.6CLIP2Video
VideoMSR-VTT-1kAtext-to-video R@1081.7CLIP2Video
VideoMSR-VTT-1kAtext-to-video R@572.6CLIP2Video
VideoMSR-VTT-1kAvideo-to-text Mean Rank10.2CLIP2Video
VideoMSR-VTT-1kAvideo-to-text Median Rank2CLIP2Video
VideoMSR-VTT-1kAvideo-to-text R@143.3CLIP2Video
VideoMSR-VTT-1kAvideo-to-text R@1082.1CLIP2Video
VideoMSR-VTT-1kAvideo-to-text R@572.3CLIP2Video
VideoVATEXtext-to-video R@157.3CLIP2Video
VideoVATEXtext-to-video R@1090CLIP2Video
VideoVATEXtext-to-video R@5095.5CLIP2Video
VideoMSR-VTTtext-to-video Mean Rank45.4CLIP2Video
VideoMSR-VTTtext-to-video Median Rank4CLIP2Video
VideoMSR-VTTtext-to-video R@129.8CLIP2Video
VideoMSR-VTTtext-to-video R@1066.2CLIP2Video
VideoMSR-VTTtext-to-video R@555.5CLIP2Video
VideoMSR-VTTvideo-to-text Mean Rank5.3CLIP2Video
VideoMSR-VTTvideo-to-text Median Rank1CLIP2Video
VideoMSR-VTTvideo-to-text R@154.6CLIP2Video
VideoMSR-VTTvideo-to-text R@1090.8CLIP2Video
VideoMSR-VTTvideo-to-text R@582.1CLIP2Video
Video RetrievalMSR-VTT-1kAtext-to-video Mean Rank14.6CLIP2Video
Video RetrievalMSR-VTT-1kAtext-to-video Median Rank2CLIP2Video
Video RetrievalMSR-VTT-1kAtext-to-video R@145.6CLIP2Video
Video RetrievalMSR-VTT-1kAtext-to-video R@1081.7CLIP2Video
Video RetrievalMSR-VTT-1kAtext-to-video R@572.6CLIP2Video
Video RetrievalMSR-VTT-1kAvideo-to-text Mean Rank10.2CLIP2Video
Video RetrievalMSR-VTT-1kAvideo-to-text Median Rank2CLIP2Video
Video RetrievalMSR-VTT-1kAvideo-to-text R@143.3CLIP2Video
Video RetrievalMSR-VTT-1kAvideo-to-text R@1082.1CLIP2Video
Video RetrievalMSR-VTT-1kAvideo-to-text R@572.3CLIP2Video
Video RetrievalVATEXtext-to-video R@157.3CLIP2Video
Video RetrievalVATEXtext-to-video R@1090CLIP2Video
Video RetrievalVATEXtext-to-video R@5095.5CLIP2Video
Video RetrievalMSR-VTTtext-to-video Mean Rank45.4CLIP2Video
Video RetrievalMSR-VTTtext-to-video Median Rank4CLIP2Video
Video RetrievalMSR-VTTtext-to-video R@129.8CLIP2Video
Video RetrievalMSR-VTTtext-to-video R@1066.2CLIP2Video
Video RetrievalMSR-VTTtext-to-video R@555.5CLIP2Video
Video RetrievalMSR-VTTvideo-to-text Mean Rank5.3CLIP2Video
Video RetrievalMSR-VTTvideo-to-text Median Rank1CLIP2Video
Video RetrievalMSR-VTTvideo-to-text R@154.6CLIP2Video
Video RetrievalMSR-VTTvideo-to-text R@1090.8CLIP2Video
Video RetrievalMSR-VTTvideo-to-text R@582.1CLIP2Video

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17Making Language Model a Hierarchical Classifier and Generator2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17