TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/X-CLIP: End-to-End Multi-grained Contrastive Learning for ...

X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval

Yiwei Ma, Guohai Xu, Xiaoshuai Sun, Ming Yan, Ji Zhang, Rongrong Ji

2022-07-15Video RetrievalVideo-Text RetrievalText RetrievalContrastive LearningRetrieval
PaperPDFCodeCode(official)Code

Abstract

Video-text retrieval has been a crucial and fundamental task in multi-modal research. The development of video-text retrieval has been considerably promoted by large-scale multi-modal contrastive pre-training, which primarily focuses on coarse-grained or fine-grained contrast. However, cross-grained contrast, which is the contrast between coarse-grained representations and fine-grained representations, has rarely been explored in prior research. Compared with fine-grained or coarse-grained contrasts, cross-grained contrast calculate the correlation between coarse-grained features and each fine-grained feature, and is able to filter out the unnecessary fine-grained features guided by the coarse-grained feature during similarity calculation, thus improving the accuracy of retrieval. To this end, this paper presents a novel multi-grained contrastive model, namely X-CLIP, for video-text retrieval. However, another challenge lies in the similarity aggregation problem, which aims to aggregate fine-grained and cross-grained similarity matrices to instance-level similarity. To address this challenge, we propose the Attention Over Similarity Matrix (AOSM) module to make the model focus on the contrast between essential frames and words, thus lowering the impact of unnecessary frames and words on retrieval results. With multi-grained contrast and the proposed AOSM module, X-CLIP achieves outstanding performance on five widely-used video-text retrieval datasets, including MSR-VTT (49.3 R@1), MSVD (50.4 R@1), LSMDC (26.1 R@1), DiDeMo (47.8 R@1) and ActivityNet (46.2 R@1). It outperforms the previous state-of-theart by +6.3%, +6.6%, +11.1%, +6.7%, +3.8% relative improvements on these benchmarks, demonstrating the superiority of multi-grained contrast and AOSM.

Results

TaskDatasetMetricValueModel
VideoMSR-VTT-1kAtext-to-video Mean Rank12.2X-CLIP
VideoMSR-VTT-1kAtext-to-video Median Rank2X-CLIP
VideoMSR-VTT-1kAtext-to-video R@149.3X-CLIP
VideoMSR-VTT-1kAtext-to-video R@1084.8X-CLIP
VideoMSR-VTT-1kAtext-to-video R@575.8X-CLIP
VideoMSR-VTT-1kAvideo-to-text Mean Rank8.1X-CLIP
VideoMSR-VTT-1kAvideo-to-text Median Rank2X-CLIP
VideoMSR-VTT-1kAvideo-to-text R@148.9X-CLIP
VideoMSR-VTT-1kAvideo-to-text R@1084.5X-CLIP
VideoMSR-VTT-1kAvideo-to-text R@576.8X-CLIP
VideoActivityNettext-to-video Mean Rank6.8X-CLIP
VideoActivityNettext-to-video R@146.2X-CLIP
VideoActivityNettext-to-video R@575.5X-CLIP
VideoActivityNetvideo-to-text Mean Rank6.4X-CLIP
VideoActivityNetvideo-to-text R@146.4X-CLIP
VideoActivityNetvideo-to-text R@575.9X-CLIP
VideoDiDeMotext-to-video Mean Rank12.6X-CLIP
VideoDiDeMotext-to-video R@147.8X-CLIP
VideoDiDeMotext-to-video R@579.3X-CLIP
VideoDiDeMovideo-to-text Mean Rank10.5X-CLIP
VideoDiDeMovideo-to-text R@147.8X-CLIP
VideoDiDeMovideo-to-text R@1076.8X-CLIP
VideoLSMDCtext-to-video R@126.1X-CLIP
VideoLSMDCvideo-to-text R@126.9X-CLIP
VideoMSVDtext-to-video Mean Rank8.4X-CLIP
VideoMSVDtext-to-video R@150.4X-CLIP
VideoMSVDtext-to-video R@580.6X-CLIP
VideoMSVDvideo-to-text Mean Rank4.2X-CLIP
VideoMSVDvideo-to-text R@166.8X-CLIP
VideoMSVDvideo-to-text R@1090.4X-CLIP
Video RetrievalMSR-VTT-1kAtext-to-video Mean Rank12.2X-CLIP
Video RetrievalMSR-VTT-1kAtext-to-video Median Rank2X-CLIP
Video RetrievalMSR-VTT-1kAtext-to-video R@149.3X-CLIP
Video RetrievalMSR-VTT-1kAtext-to-video R@1084.8X-CLIP
Video RetrievalMSR-VTT-1kAtext-to-video R@575.8X-CLIP
Video RetrievalMSR-VTT-1kAvideo-to-text Mean Rank8.1X-CLIP
Video RetrievalMSR-VTT-1kAvideo-to-text Median Rank2X-CLIP
Video RetrievalMSR-VTT-1kAvideo-to-text R@148.9X-CLIP
Video RetrievalMSR-VTT-1kAvideo-to-text R@1084.5X-CLIP
Video RetrievalMSR-VTT-1kAvideo-to-text R@576.8X-CLIP
Video RetrievalActivityNettext-to-video Mean Rank6.8X-CLIP
Video RetrievalActivityNettext-to-video R@146.2X-CLIP
Video RetrievalActivityNettext-to-video R@575.5X-CLIP
Video RetrievalActivityNetvideo-to-text Mean Rank6.4X-CLIP
Video RetrievalActivityNetvideo-to-text R@146.4X-CLIP
Video RetrievalActivityNetvideo-to-text R@575.9X-CLIP
Video RetrievalDiDeMotext-to-video Mean Rank12.6X-CLIP
Video RetrievalDiDeMotext-to-video R@147.8X-CLIP
Video RetrievalDiDeMotext-to-video R@579.3X-CLIP
Video RetrievalDiDeMovideo-to-text Mean Rank10.5X-CLIP
Video RetrievalDiDeMovideo-to-text R@147.8X-CLIP
Video RetrievalDiDeMovideo-to-text R@1076.8X-CLIP
Video RetrievalLSMDCtext-to-video R@126.1X-CLIP
Video RetrievalLSMDCvideo-to-text R@126.9X-CLIP
Video RetrievalMSVDtext-to-video Mean Rank8.4X-CLIP
Video RetrievalMSVDtext-to-video R@150.4X-CLIP
Video RetrievalMSVDtext-to-video R@580.6X-CLIP
Video RetrievalMSVDvideo-to-text Mean Rank4.2X-CLIP
Video RetrievalMSVDvideo-to-text R@166.8X-CLIP
Video RetrievalMSVDvideo-to-text R@1090.4X-CLIP

Related Papers

SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management2025-07-17SGCL: Unifying Self-Supervised and Supervised Learning for Graph Recommendation2025-07-17From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17Similarity-Guided Diffusion for Contrastive Sequential Recommendation2025-07-16