TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/A Straightforward Framework For Video Retrieval Using CLIP

A Straightforward Framework For Video Retrieval Using CLIP

Jesús Andrés Portillo-Quintero, José Carlos Ortiz-Bayliss, Hugo Terashima-Marín

2021-02-24Video RetrievalRetrieval
PaperPDFCode(official)

Abstract

Video Retrieval is a challenging task where a text query is matched to a video or vice versa. Most of the existing approaches for addressing such a problem rely on annotations made by the users. Although simple, this approach is not always feasible in practice. In this work, we explore the application of the language-image model, CLIP, to obtain video representations without the need for said annotations. This model was explicitly trained to learn a common space where images and text can be compared. Using various techniques described in this document, we extended its application to videos, obtaining state-of-the-art results on the MSR-VTT and MSVD benchmarks.

Results

TaskDatasetMetricValueModel
VideoMSR-VTT-1kAtext-to-video Median Rank4CLIP
VideoMSR-VTT-1kAtext-to-video R@131.2CLIP
VideoMSR-VTT-1kAtext-to-video R@1064.2CLIP
VideoMSR-VTT-1kAtext-to-video R@553.7CLIP
VideoMSR-VTT-1kAvideo-to-text Median Rank5CLIP
VideoMSR-VTT-1kAvideo-to-text R@127.2CLIP
VideoMSR-VTT-1kAvideo-to-text R@1062.6CLIP
VideoMSR-VTT-1kAvideo-to-text R@551.7CLIP
VideoMSR-VTTtext-to-video Median Rank10CLIP
VideoMSR-VTTtext-to-video R@121.4CLIP
VideoMSR-VTTtext-to-video R@1050.4CLIP
VideoMSR-VTTtext-to-video R@541.1CLIP
VideoMSR-VTTvideo-to-text Median Rank2CLIP
VideoMSR-VTTvideo-to-text R@140.3CLIP
VideoMSR-VTTvideo-to-text R@1079.2CLIP
VideoMSR-VTTvideo-to-text R@569.7CLIP
VideoLSMDCtext-to-video Median Rank56.5CLIP
VideoLSMDCtext-to-video R@111.3CLIP
VideoLSMDCtext-to-video R@1029.2CLIP
VideoLSMDCtext-to-video R@522.7CLIP
VideoLSMDCvideo-to-text Median Rank73CLIP
VideoLSMDCvideo-to-text R@16.8CLIP
VideoLSMDCvideo-to-text R@1022.1CLIP
VideoLSMDCvideo-to-text R@516.4CLIP
VideoMSVDtext-to-video Median Rank3CLIP
VideoMSVDtext-to-video R@137CLIP
VideoMSVDtext-to-video R@1073.8CLIP
VideoMSVDtext-to-video R@564.1CLIP
VideoMSVDvideo-to-text Median Rank1CLIP
VideoMSVDvideo-to-text R@159.9CLIP
VideoMSVDvideo-to-text R@1090.7CLIP
VideoMSVDvideo-to-text R@585.2CLIP
Image RetrievalConQA ConceptualR-precision6.8CLIP
Video RetrievalMSR-VTT-1kAtext-to-video Median Rank4CLIP
Video RetrievalMSR-VTT-1kAtext-to-video R@131.2CLIP
Video RetrievalMSR-VTT-1kAtext-to-video R@1064.2CLIP
Video RetrievalMSR-VTT-1kAtext-to-video R@553.7CLIP
Video RetrievalMSR-VTT-1kAvideo-to-text Median Rank5CLIP
Video RetrievalMSR-VTT-1kAvideo-to-text R@127.2CLIP
Video RetrievalMSR-VTT-1kAvideo-to-text R@1062.6CLIP
Video RetrievalMSR-VTT-1kAvideo-to-text R@551.7CLIP
Video RetrievalMSR-VTTtext-to-video Median Rank10CLIP
Video RetrievalMSR-VTTtext-to-video R@121.4CLIP
Video RetrievalMSR-VTTtext-to-video R@1050.4CLIP
Video RetrievalMSR-VTTtext-to-video R@541.1CLIP
Video RetrievalMSR-VTTvideo-to-text Median Rank2CLIP
Video RetrievalMSR-VTTvideo-to-text R@140.3CLIP
Video RetrievalMSR-VTTvideo-to-text R@1079.2CLIP
Video RetrievalMSR-VTTvideo-to-text R@569.7CLIP
Video RetrievalLSMDCtext-to-video Median Rank56.5CLIP
Video RetrievalLSMDCtext-to-video R@111.3CLIP
Video RetrievalLSMDCtext-to-video R@1029.2CLIP
Video RetrievalLSMDCtext-to-video R@522.7CLIP
Video RetrievalLSMDCvideo-to-text Median Rank73CLIP
Video RetrievalLSMDCvideo-to-text R@16.8CLIP
Video RetrievalLSMDCvideo-to-text R@1022.1CLIP
Video RetrievalLSMDCvideo-to-text R@516.4CLIP
Video RetrievalMSVDtext-to-video Median Rank3CLIP
Video RetrievalMSVDtext-to-video R@137CLIP
Video RetrievalMSVDtext-to-video R@1073.8CLIP
Video RetrievalMSVDtext-to-video R@564.1CLIP
Video RetrievalMSVDvideo-to-text Median Rank1CLIP
Video RetrievalMSVDvideo-to-text R@159.9CLIP
Video RetrievalMSVDvideo-to-text R@1090.7CLIP
Video RetrievalMSVDvideo-to-text R@585.2CLIP

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16Context-Aware Search and Retrieval Over Erasure Channels2025-07-16Seq vs Seq: An Open Suite of Paired Encoders and Decoders2025-07-15