TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Improving Video-Text Retrieval by Multi-Stream Corpus Alig...

Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss

Xing Cheng, Hezheng Lin, Xiangyu Wu, Fan Yang, Dong Shen

2021-09-09Video RetrievalVideo-Text RetrievalText RetrievalRetrieval
PaperPDFCode(official)Code(official)

Abstract

Employing large-scale pre-trained model CLIP to conduct video-text retrieval task (VTR) has become a new trend, which exceeds previous VTR methods. Though, due to the heterogeneity of structures and contents between video and text, previous CLIP-based models are prone to overfitting in the training phase, resulting in relatively poor retrieval performance. In this paper, we propose a multi-stream Corpus Alignment network with single gate Mixture-of-Experts (CAMoE) and a novel Dual Softmax Loss (DSL) to solve the two heterogeneity. The CAMoE employs Mixture-of-Experts (MoE) to extract multi-perspective video representations, including action, entity, scene, etc., then align them with the corresponding part of the text. In this stage, we conduct massive explorations towards the feature extraction module and feature alignment module. DSL is proposed to avoid the one-way optimum-match which occurs in previous contrastive methods. Introducing the intrinsic prior of each pair in a batch, DSL serves as a reviser to correct the similarity matrix and achieves the dual optimal match. DSL is easy to implement with only one-line code but improves significantly. The results show that the proposed CAMoE and DSL are of strong efficiency, and each of them is capable of achieving State-of-The-Art (SOTA) individually on various benchmarks such as MSR-VTT, MSVD, and LSMDC. Further, with both of them, the performance is advanced to a big extend, surpassing the previous SOTA methods for around 4.6\% R@1 in MSR-VTT.

Results

TaskDatasetMetricValueModel
VideoMSR-VTT-1kAtext-to-video Mean Rank12.4CAMoE
VideoMSR-VTT-1kAtext-to-video Median Rank2CAMoE
VideoMSR-VTT-1kAtext-to-video R@148.8CAMoE
VideoMSR-VTT-1kAtext-to-video R@1085.3CAMoE
VideoMSR-VTT-1kAtext-to-video R@575.6CAMoE
VideoMSR-VTT-1kAvideo-to-text Mean Rank9.9CAMoE
VideoMSR-VTT-1kAvideo-to-text Median Rank2CAMoE
VideoMSR-VTT-1kAvideo-to-text R@150.3CAMoE
VideoMSR-VTT-1kAvideo-to-text R@1083.8CAMoE
VideoMSR-VTT-1kAvideo-to-text R@574.6CAMoE
VideoActivityNettext-to-video Mean Rank6.3CAMoE
VideoActivityNettext-to-video Median Rank1CAMoE
VideoActivityNettext-to-video R@151CAMoE
VideoActivityNettext-to-video R@1087.6CAMoE
VideoActivityNettext-to-video R@577.7CAMoE
VideoDiDeMotext-to-video Mean Rank16.3CAMoE
VideoDiDeMotext-to-video Median Rank2CAMoE
VideoDiDeMotext-to-video R@143.8CAMoE
VideoDiDeMotext-to-video R@1079.9CAMoE
VideoDiDeMotext-to-video R@571.4CAMoE
VideoDiDeMovideo-to-text Mean Rank10.2CAMoE
VideoDiDeMovideo-to-text Median Rank2CAMoE
VideoDiDeMovideo-to-text R@145.5CAMoE
VideoDiDeMovideo-to-text R@1080.5CAMoE
VideoMSR-VTTtext-to-video Mean Rank42.6CAMoE
VideoMSR-VTTtext-to-video Median Rank3CAMoE
VideoMSR-VTTtext-to-video R@132.9CAMoE
VideoMSR-VTTtext-to-video R@1068.4CAMoE
VideoMSR-VTTtext-to-video R@558.3CAMoE
VideoMSR-VTTvideo-to-text Mean Rank3.8CAMoE
VideoMSR-VTTvideo-to-text Median Rank1CAMoE
VideoMSR-VTTvideo-to-text R@159.8CAMoE
VideoMSR-VTTvideo-to-text R@1092.8CAMoE
VideoMSR-VTTvideo-to-text R@586.2CAMoE
VideoLSMDCtext-to-video Mean Rank54.4CAMoE
VideoLSMDCtext-to-video R@125.9CAMoE
VideoLSMDCtext-to-video R@1053.7CAMoE
VideoLSMDCtext-to-video R@546.1CAMoE
VideoMSVDtext-to-video Mean Rank8.9CAMoE
VideoMSVDtext-to-video Median Rank1CAMoE
VideoMSVDtext-to-video R@151.8CAMoE
VideoMSVDtext-to-video R@1087.6CAMoE
VideoMSVDtext-to-video R@587.6CAMoE
VideoMSVDvideo-to-text Mean Rank3.1CAMoE
VideoMSVDvideo-to-text Median Rank1CAMoE
VideoMSVDvideo-to-text R@169.3CAMoE
VideoMSVDvideo-to-text R@1094.6CAMoE
VideoMSVDvideo-to-text R@590.6CAMoE
Video RetrievalMSR-VTT-1kAtext-to-video Mean Rank12.4CAMoE
Video RetrievalMSR-VTT-1kAtext-to-video Median Rank2CAMoE
Video RetrievalMSR-VTT-1kAtext-to-video R@148.8CAMoE
Video RetrievalMSR-VTT-1kAtext-to-video R@1085.3CAMoE
Video RetrievalMSR-VTT-1kAtext-to-video R@575.6CAMoE
Video RetrievalMSR-VTT-1kAvideo-to-text Mean Rank9.9CAMoE
Video RetrievalMSR-VTT-1kAvideo-to-text Median Rank2CAMoE
Video RetrievalMSR-VTT-1kAvideo-to-text R@150.3CAMoE
Video RetrievalMSR-VTT-1kAvideo-to-text R@1083.8CAMoE
Video RetrievalMSR-VTT-1kAvideo-to-text R@574.6CAMoE
Video RetrievalActivityNettext-to-video Mean Rank6.3CAMoE
Video RetrievalActivityNettext-to-video Median Rank1CAMoE
Video RetrievalActivityNettext-to-video R@151CAMoE
Video RetrievalActivityNettext-to-video R@1087.6CAMoE
Video RetrievalActivityNettext-to-video R@577.7CAMoE
Video RetrievalDiDeMotext-to-video Mean Rank16.3CAMoE
Video RetrievalDiDeMotext-to-video Median Rank2CAMoE
Video RetrievalDiDeMotext-to-video R@143.8CAMoE
Video RetrievalDiDeMotext-to-video R@1079.9CAMoE
Video RetrievalDiDeMotext-to-video R@571.4CAMoE
Video RetrievalDiDeMovideo-to-text Mean Rank10.2CAMoE
Video RetrievalDiDeMovideo-to-text Median Rank2CAMoE
Video RetrievalDiDeMovideo-to-text R@145.5CAMoE
Video RetrievalDiDeMovideo-to-text R@1080.5CAMoE
Video RetrievalMSR-VTTtext-to-video Mean Rank42.6CAMoE
Video RetrievalMSR-VTTtext-to-video Median Rank3CAMoE
Video RetrievalMSR-VTTtext-to-video R@132.9CAMoE
Video RetrievalMSR-VTTtext-to-video R@1068.4CAMoE
Video RetrievalMSR-VTTtext-to-video R@558.3CAMoE
Video RetrievalMSR-VTTvideo-to-text Mean Rank3.8CAMoE
Video RetrievalMSR-VTTvideo-to-text Median Rank1CAMoE
Video RetrievalMSR-VTTvideo-to-text R@159.8CAMoE
Video RetrievalMSR-VTTvideo-to-text R@1092.8CAMoE
Video RetrievalMSR-VTTvideo-to-text R@586.2CAMoE
Video RetrievalLSMDCtext-to-video Mean Rank54.4CAMoE
Video RetrievalLSMDCtext-to-video R@125.9CAMoE
Video RetrievalLSMDCtext-to-video R@1053.7CAMoE
Video RetrievalLSMDCtext-to-video R@546.1CAMoE
Video RetrievalMSVDtext-to-video Mean Rank8.9CAMoE
Video RetrievalMSVDtext-to-video Median Rank1CAMoE
Video RetrievalMSVDtext-to-video R@151.8CAMoE
Video RetrievalMSVDtext-to-video R@1087.6CAMoE
Video RetrievalMSVDtext-to-video R@587.6CAMoE
Video RetrievalMSVDvideo-to-text Mean Rank3.1CAMoE
Video RetrievalMSVDvideo-to-text Median Rank1CAMoE
Video RetrievalMSVDvideo-to-text R@169.3CAMoE
Video RetrievalMSVDvideo-to-text R@1094.6CAMoE
Video RetrievalMSVDvideo-to-text R@590.6CAMoE

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16Context-Aware Search and Retrieval Over Erasure Channels2025-07-16Seq vs Seq: An Open Suite of Paired Encoders and Decoders2025-07-15