TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/COTS: Collaborative Two-Stream Vision-Language Pre-Trainin...

COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval

Haoyu Lu, Nanyi Fei, Yuqi Huo, Yizhao Gao, Zhiwu Lu, Ji-Rong Wen

2022-04-15CVPR 2022 1Cross-Modal RetrievalImage-text RetrievalVideo RetrievalText RetrievalImage to textText to Video RetrievalContrastive LearningImage-to-Text RetrievalRetrieval
PaperPDF

Abstract

Large-scale single-stream pre-training has shown dramatic performance in image-text retrieval. Regrettably, it faces low inference efficiency due to heavy attention layers. Recently, two-stream methods like CLIP and ALIGN with high inference efficiency have also shown promising performance, however, they only consider instance-level alignment between the two streams (thus there is still room for improvement). To overcome these limitations, we propose a novel COllaborative Two-Stream vision-language pretraining model termed COTS for image-text retrieval by enhancing cross-modal interaction. In addition to instance level alignment via momentum contrastive learning, we leverage two extra levels of cross-modal interactions in our COTS: (1) Token-level interaction - a masked visionlanguage modeling (MVLM) learning objective is devised without using a cross-stream network module, where variational autoencoder is imposed on the visual encoder to generate visual tokens for each image. (2) Task-level interaction - a KL-alignment learning objective is devised between text-to-image and image-to-text retrieval tasks, where the probability distribution per task is computed with the negative queues in momentum contrastive learning. Under a fair comparison setting, our COTS achieves the highest performance among all two-stream methods and comparable performance (but with 10,800X faster in inference) w.r.t. the latest single-stream methods. Importantly, our COTS is also applicable to text-to-video retrieval, yielding new state-ofthe-art on the widely-used MSR-VTT dataset.

Results

TaskDatasetMetricValueModel
VideoMSR-VTT-1kAtext-to-video Median Rank2COTS
VideoMSR-VTT-1kAtext-to-video R@136.8COTS
VideoMSR-VTT-1kAtext-to-video R@1073.2COTS
VideoMSR-VTT-1kAtext-to-video R@563.8COTS
VideoMSR-VTTtext-to-video Median Rank3COTS
VideoMSR-VTTtext-to-video R@132.1COTS
VideoMSR-VTTtext-to-video R@1070.2COTS
VideoMSR-VTTtext-to-video R@560.8COTS
Video RetrievalMSR-VTT-1kAtext-to-video Median Rank2COTS
Video RetrievalMSR-VTT-1kAtext-to-video R@136.8COTS
Video RetrievalMSR-VTT-1kAtext-to-video R@1073.2COTS
Video RetrievalMSR-VTT-1kAtext-to-video R@563.8COTS
Video RetrievalMSR-VTTtext-to-video Median Rank3COTS
Video RetrievalMSR-VTTtext-to-video R@132.1COTS
Video RetrievalMSR-VTTtext-to-video R@1070.2COTS
Video RetrievalMSR-VTTtext-to-video R@560.8COTS

Related Papers

SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management2025-07-17SGCL: Unifying Self-Supervised and Supervised Learning for Graph Recommendation2025-07-17From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17Similarity-Guided Diffusion for Contrastive Sequential Recommendation2025-07-16