TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/CLIP4Clip: An Empirical Study of CLIP for End to End Video...

CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval

Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, Tianrui Li

2021-04-18Video RetrievalVideo-Text RetrievalZero-Shot Video RetrievalText RetrievalText to Video RetrievalVideo UnderstandingRetrieval
PaperPDFCodeCode(official)CodeCodeCode

Abstract

Video-text retrieval plays an essential role in multi-modal research and has been widely used in many real-world web applications. The CLIP (Contrastive Language-Image Pre-training), an image-language pre-training model, has demonstrated the power of visual concepts learning from web collected image-text datasets. In this paper, we propose a CLIP4Clip model to transfer the knowledge of the CLIP model to video-language retrieval in an end-to-end manner. Several questions are investigated via empirical studies: 1) Whether image feature is enough for video-text retrieval? 2) How a post-pretraining on a large-scale video-text dataset based on the CLIP affect the performance? 3) What is the practical mechanism to model temporal dependency between video frames? And 4) The Hyper-parameters sensitivity of the model on video-text retrieval task. Extensive experimental results present that the CLIP4Clip model transferred from the CLIP can achieve SOTA results on various video-text retrieval datasets, including MSR-VTT, MSVC, LSMDC, ActivityNet, and DiDeMo. We release our code at https://github.com/ArrowLuo/CLIP4Clip.

Results

TaskDatasetMetricValueModel
VideoMSR-VTT-1kAtext-to-video Mean Rank15.3CLIP4Clip
VideoMSR-VTT-1kAtext-to-video Median Rank2CLIP4Clip
VideoMSR-VTT-1kAtext-to-video R@1081.6CLIP4Clip
VideoMSR-VTT-1kAvideo-to-text Median Rank2CLIP4Clip
VideoMSR-VTT-1kAvideo-to-text R@142.7CLIP4Clip
VideoMSR-VTT-1kAvideo-to-text R@1080.6CLIP4Clip
VideoMSR-VTT-1kAvideo-to-text R@570.9CLIP4Clip
VideoActivityNettext-to-video Mean Rank7.5CLIP4Clip
VideoActivityNettext-to-video Median Rank2CLIP4Clip
VideoActivityNettext-to-video R@140.5CLIP4Clip
VideoActivityNettext-to-video R@573.4CLIP4Clip
VideoActivityNettext-to-video R@5098.2CLIP4Clip
VideoDiDeMotext-to-video Mean Rank17.5CLIP4Clip
VideoDiDeMotext-to-video Median Rank2CLIP4Clip
VideoDiDeMotext-to-video R@143.4CLIP4Clip
VideoDiDeMotext-to-video R@1080.6CLIP4Clip
VideoDiDeMotext-to-video R@570.2CLIP4Clip
VideoMSR-VTTtext-to-video R@144.5CLIP4Clip-seqTransf
VideoMSR-VTTtext-to-video R@1081.6CLIP4Clip-seqTransf
VideoMSR-VTTtext-to-video R@571.4CLIP4Clip-seqTransf
VideoLSMDCtext-to-video Mean Rank58CLIP4Clip
VideoLSMDCtext-to-video R@121.6CLIP4Clip
VideoLSMDCtext-to-video R@1049.8CLIP4Clip
VideoLSMDCtext-to-video R@541.8CLIP4Clip
VideoMSVDtext-to-video Mean Rank10CLIP4Clip
VideoMSVDtext-to-video Median Rank2CLIP4Clip
VideoMSVDtext-to-video R@146.2CLIP4Clip
VideoMSVDtext-to-video R@1084.6CLIP4Clip
VideoMSVDtext-to-video R@576.1CLIP4Clip
VideoMSVDvideo-to-text Median Rank1CLIP4Clip
VideoMSVDvideo-to-text R@162CLIP4Clip
VideoMSVDvideo-to-text R@1092.6CLIP4Clip
VideoMSVDvideo-to-text R@587.3CLIP4Clip
Video RetrievalMSR-VTT-1kAtext-to-video Mean Rank15.3CLIP4Clip
Video RetrievalMSR-VTT-1kAtext-to-video Median Rank2CLIP4Clip
Video RetrievalMSR-VTT-1kAtext-to-video R@1081.6CLIP4Clip
Video RetrievalMSR-VTT-1kAvideo-to-text Median Rank2CLIP4Clip
Video RetrievalMSR-VTT-1kAvideo-to-text R@142.7CLIP4Clip
Video RetrievalMSR-VTT-1kAvideo-to-text R@1080.6CLIP4Clip
Video RetrievalMSR-VTT-1kAvideo-to-text R@570.9CLIP4Clip
Video RetrievalActivityNettext-to-video Mean Rank7.5CLIP4Clip
Video RetrievalActivityNettext-to-video Median Rank2CLIP4Clip
Video RetrievalActivityNettext-to-video R@140.5CLIP4Clip
Video RetrievalActivityNettext-to-video R@573.4CLIP4Clip
Video RetrievalActivityNettext-to-video R@5098.2CLIP4Clip
Video RetrievalDiDeMotext-to-video Mean Rank17.5CLIP4Clip
Video RetrievalDiDeMotext-to-video Median Rank2CLIP4Clip
Video RetrievalDiDeMotext-to-video R@143.4CLIP4Clip
Video RetrievalDiDeMotext-to-video R@1080.6CLIP4Clip
Video RetrievalDiDeMotext-to-video R@570.2CLIP4Clip
Video RetrievalMSR-VTTtext-to-video R@144.5CLIP4Clip-seqTransf
Video RetrievalMSR-VTTtext-to-video R@1081.6CLIP4Clip-seqTransf
Video RetrievalMSR-VTTtext-to-video R@571.4CLIP4Clip-seqTransf
Video RetrievalLSMDCtext-to-video Mean Rank58CLIP4Clip
Video RetrievalLSMDCtext-to-video R@121.6CLIP4Clip
Video RetrievalLSMDCtext-to-video R@1049.8CLIP4Clip
Video RetrievalLSMDCtext-to-video R@541.8CLIP4Clip
Video RetrievalMSVDtext-to-video Mean Rank10CLIP4Clip
Video RetrievalMSVDtext-to-video Median Rank2CLIP4Clip
Video RetrievalMSVDtext-to-video R@146.2CLIP4Clip
Video RetrievalMSVDtext-to-video R@1084.6CLIP4Clip
Video RetrievalMSVDtext-to-video R@576.1CLIP4Clip
Video RetrievalMSVDvideo-to-text Median Rank1CLIP4Clip
Video RetrievalMSVDvideo-to-text R@162CLIP4Clip
Video RetrievalMSVDvideo-to-text R@1092.6CLIP4Clip
Video RetrievalMSVDvideo-to-text R@587.3CLIP4Clip
Text to Video RetrievalMSR-VTTtext-to-video R@144.5CLIP4Clip
10-shot image generationMSR-VTTtext-to-video R@144.5CLIP4Clip
Zero-Shot Video RetrievalMSR-VTTtext-to-video Mean Rank34CLIP4Clip
Zero-Shot Video RetrievalMSR-VTTtext-to-video Median Rank4CLIP4Clip
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@132CLIP4Clip
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@1066.9CLIP4Clip
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@557CLIP4Clip
Zero-Shot Video RetrievalMSVDtext-to-video Mean Rank17.8CLIP4Clip
Zero-Shot Video RetrievalMSVDtext-to-video Median Rank2CLIP4Clip
Zero-Shot Video RetrievalMSVDtext-to-video R@138.5CLIP4Clip
Zero-Shot Video RetrievalMSVDtext-to-video R@1076.8CLIP4Clip
Zero-Shot Video RetrievalMSVDtext-to-video R@566.9CLIP4Clip
Zero-Shot Video RetrievalLSMDCtext-to-video Mean Rank117CLIP4Clip
Zero-Shot Video RetrievalLSMDCtext-to-video Median Rank28CLIP4Clip
Zero-Shot Video RetrievalLSMDCtext-to-video R@115.1CLIP4Clip
Zero-Shot Video RetrievalLSMDCtext-to-video R@1036.4CLIP4Clip
Zero-Shot Video RetrievalLSMDCtext-to-video R@528.5CLIP4Clip

Related Papers

VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16Context-Aware Search and Retrieval Over Erasure Channels2025-07-16