TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/CLIP2TV: Align, Match and Distill for Video-Text Retrieval

CLIP2TV: Align, Match and Distill for Video-Text Retrieval

Zijian Gao, Jingyu Liu, Weiqi Sun, Sheng Chen, Dedan Chang, Lili Zhao

2021-11-10Video RetrievalRepresentation LearningVideo-Text RetrievalText RetrievalRetrieval
PaperPDF

Abstract

Modern video-text retrieval frameworks basically consist of three parts: video encoder, text encoder and the similarity head. With the success on both visual and textual representation learning, transformer based encoders and fusion methods have also been adopted in the field of video-text retrieval. In this report, we present CLIP2TV, aiming at exploring where the critical elements lie in transformer based methods. To achieve this, We first revisit some recent works on multi-modal learning, then introduce some techniques into video-text retrieval, finally evaluate them through extensive experiments in different configurations. Notably, CLIP2TV achieves 52.9@R1 on MSR-VTT dataset, outperforming the previous SOTA result by 4.1%.

Results

TaskDatasetMetricValueModel
VideoMSR-VTT-1kAtext-to-video Mean Rank12.8CLIP2TV
VideoMSR-VTT-1kAtext-to-video Median Rank1CLIP2TV
VideoMSR-VTT-1kAtext-to-video R@152.9CLIP2TV
VideoMSR-VTT-1kAtext-to-video R@1086.5CLIP2TV
VideoMSR-VTT-1kAtext-to-video R@578.5CLIP2TV
VideoMSR-VTT-1kAvideo-to-text Mean Rank9CLIP2TV
VideoMSR-VTT-1kAvideo-to-text Median Rank1CLIP2TV
VideoMSR-VTT-1kAvideo-to-text R@154.1CLIP2TV
VideoMSR-VTT-1kAvideo-to-text R@1085.7CLIP2TV
VideoMSR-VTT-1kAvideo-to-text R@577.4CLIP2TV
VideoMSR-VTTtext-to-video Mean Rank44.7CLIP2TV
VideoMSR-VTTtext-to-video Median Rank3CLIP2TV
VideoMSR-VTTtext-to-video R@133.1CLIP2TV
VideoMSR-VTTtext-to-video R@1068.9CLIP2TV
VideoMSR-VTTtext-to-video R@558.9CLIP2TV
Video RetrievalMSR-VTT-1kAtext-to-video Mean Rank12.8CLIP2TV
Video RetrievalMSR-VTT-1kAtext-to-video Median Rank1CLIP2TV
Video RetrievalMSR-VTT-1kAtext-to-video R@152.9CLIP2TV
Video RetrievalMSR-VTT-1kAtext-to-video R@1086.5CLIP2TV
Video RetrievalMSR-VTT-1kAtext-to-video R@578.5CLIP2TV
Video RetrievalMSR-VTT-1kAvideo-to-text Mean Rank9CLIP2TV
Video RetrievalMSR-VTT-1kAvideo-to-text Median Rank1CLIP2TV
Video RetrievalMSR-VTT-1kAvideo-to-text R@154.1CLIP2TV
Video RetrievalMSR-VTT-1kAvideo-to-text R@1085.7CLIP2TV
Video RetrievalMSR-VTT-1kAvideo-to-text R@577.4CLIP2TV
Video RetrievalMSR-VTTtext-to-video Mean Rank44.7CLIP2TV
Video RetrievalMSR-VTTtext-to-video Median Rank3CLIP2TV
Video RetrievalMSR-VTTtext-to-video R@133.1CLIP2TV
Video RetrievalMSR-VTTtext-to-video R@1068.9CLIP2TV
Video RetrievalMSR-VTTtext-to-video R@558.9CLIP2TV

Related Papers

Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper2025-07-20Spectral Bellman Method: Unifying Representation and Exploration in RL2025-07-17Boosting Team Modeling through Tempo-Relational Representation Learning2025-07-17From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17Similarity-Guided Diffusion for Contrastive Sequential Recommendation2025-07-16