TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/LaT: Latent Translation with Cycle-Consistency for Video-T...

LaT: Latent Translation with Cycle-Consistency for Video-Text Retrieval

Jinbin Bai, Chunhui Liu, Feiyue Ni, Haofan Wang, Mengying Hu, Xiaofeng Guo, Lele Cheng

2022-07-11Video RetrievalRepresentation LearningVideo-Text RetrievalZero-Shot Video RetrievalText RetrievalTranslationRetrieval
PaperPDF

Abstract

Video-text retrieval is a class of cross-modal representation learning problems, where the goal is to select the video which corresponds to the text query between a given text query and a pool of candidate videos. The contrastive paradigm of vision-language pretraining has shown promising success with large-scale datasets and unified transformer architecture, and demonstrated the power of a joint latent space. Despite this, the intrinsic divergence between the visual domain and textual domain is still far from being eliminated, and projecting different modalities into a joint latent space might result in the distorting of the information inside the single modality. To overcome the above issue, we present a novel mechanism for learning the translation relationship from a source modality space $\mathcal{S}$ to a target modality space $\mathcal{T}$ without the need for a joint latent space, which bridges the gap between visual and textual domains. Furthermore, to keep cycle consistency between translations, we adopt a cycle loss involving both forward translations from $\mathcal{S}$ to the predicted target space $\mathcal{T'}$, and backward translations from $\mathcal{T'}$ back to $\mathcal{S}$. Extensive experiments conducted on MSR-VTT, MSVD, and DiDeMo datasets demonstrate the superiority and effectiveness of our LaT approach compared with vanilla state-of-the-art methods.

Results

TaskDatasetMetricValueModel
Zero-Shot Video RetrievalMSR-VTTtext-to-video Median Rank8LaT
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@123.4LaT
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@1053.3LaT
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@544.1LaT
Zero-Shot Video RetrievalMSR-VTTvideo-to-text Median Rank12LaT
Zero-Shot Video RetrievalMSR-VTTvideo-to-text R@117.2LaT
Zero-Shot Video RetrievalMSR-VTTvideo-to-text R@1047.9LaT
Zero-Shot Video RetrievalMSR-VTTvideo-to-text R@536.2LaT
Zero-Shot Video RetrievalMSVDtext-to-video Median Rank2LaT
Zero-Shot Video RetrievalMSVDtext-to-video R@136.9LaT
Zero-Shot Video RetrievalMSVDtext-to-video R@1081LaT
Zero-Shot Video RetrievalMSVDtext-to-video R@568.6LaT
Zero-Shot Video RetrievalMSVDvideo-to-text Median Rank3LaT
Zero-Shot Video RetrievalMSVDvideo-to-text R@134.4LaT
Zero-Shot Video RetrievalMSVDvideo-to-text R@1079.2LaT
Zero-Shot Video RetrievalMSVDvideo-to-text R@569LaT
Zero-Shot Video RetrievalDiDeMotext-to-video Median Rank7LaT
Zero-Shot Video RetrievalDiDeMotext-to-video R@122.6LaT
Zero-Shot Video RetrievalDiDeMotext-to-video R@1058.9LaT
Zero-Shot Video RetrievalDiDeMotext-to-video R@545.9LaT
Zero-Shot Video RetrievalDiDeMovideo-to-text Median Rank7LaT
Zero-Shot Video RetrievalDiDeMovideo-to-text R@122.5LaT
Zero-Shot Video RetrievalDiDeMovideo-to-text R@1056.8LaT
Zero-Shot Video RetrievalDiDeMovideo-to-text R@545.2LaT

Related Papers

Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper2025-07-20Spectral Bellman Method: Unifying Representation and Exploration in RL2025-07-17Boosting Team Modeling through Tempo-Relational Representation Learning2025-07-17A Translation of Probabilistic Event Calculus into Markov Decision Processes2025-07-17From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17