TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Video and Text Matching with Conditioned Embeddings

Video and Text Matching with Conditioned Embeddings

Ameen Ali, Idan Schwartz, Tamir Hazan, Lior Wolf

2021-10-21Machine TranslationVideo RetrievalText MatchingTranslation
PaperPDFCode(official)

Abstract

We present a method for matching a text sentence from a given corpus to a given video clip and vice versa. Traditionally video and text matching is done by learning a shared embedding space and the encoding of one modality is independent of the other. In this work, we encode the dataset data in a way that takes into account the query's relevant information. The power of the method is demonstrated to arise from pooling the interaction data between words and frames. Since the encoding of the video clip depends on the sentence compared to it, the representation needs to be recomputed for each potential match. To this end, we propose an efficient shallow neural network. Its training employs a hierarchical triplet loss that is extendable to paragraph/video matching. The method is simple, provides explainability, and achieves state-of-the-art results for both sentence-clip and video-text by a sizable margin across five different datasets: ActivityNet, DiDeMo, YouCook2, MSR-VTT, and LSMDC. We also show that our conditioned representation can be transferred to video-guided machine translation, where we improved the current results on VATEX. Source code is available at https://github.com/AmeenAli/VideoMatch.

Results

TaskDatasetMetricValueModel
VideoActivityNettext-to-video R@125.4Ours
VideoActivityNettext-to-video R@559.1Ours
VideoActivityNetvideo-to-text R@126.1Ours
VideoActivityNetvideo-to-text R@560Ours
VideoMSR-VTTtext-to-video Median Rank3Ours
VideoMSR-VTTtext-to-video R@126Ours
VideoMSR-VTTtext-to-video R@556.7Ours
VideoMSR-VTTvideo-to-text Median Rank3Ours
VideoMSR-VTTvideo-to-text R@126.7Ours
VideoMSR-VTTvideo-to-text R@556.5Ours
VideoLSMDCtext-to-video R@114.9Ours
VideoLSMDCtext-to-video R@533.2Ours
VideoLSMDCvideo-to-text R@115.3Ours
VideoLSMDCvideo-to-text R@534.1Ours
Video RetrievalActivityNettext-to-video R@125.4Ours
Video RetrievalActivityNettext-to-video R@559.1Ours
Video RetrievalActivityNetvideo-to-text R@126.1Ours
Video RetrievalActivityNetvideo-to-text R@560Ours
Video RetrievalMSR-VTTtext-to-video Median Rank3Ours
Video RetrievalMSR-VTTtext-to-video R@126Ours
Video RetrievalMSR-VTTtext-to-video R@556.7Ours
Video RetrievalMSR-VTTvideo-to-text Median Rank3Ours
Video RetrievalMSR-VTTvideo-to-text R@126.7Ours
Video RetrievalMSR-VTTvideo-to-text R@556.5Ours
Video RetrievalLSMDCtext-to-video R@114.9Ours
Video RetrievalLSMDCtext-to-video R@533.2Ours
Video RetrievalLSMDCvideo-to-text R@115.3Ours
Video RetrievalLSMDCvideo-to-text R@534.1Ours

Related Papers

A Translation of Probabilistic Event Calculus into Markov Decision Processes2025-07-17Function-to-Style Guidance of LLMs for Code Translation2025-07-15Speak2Sign3D: A Multi-modal Pipeline for English Speech to American Sign Language Animation2025-07-09Pun Intended: Multi-Agent Translation of Wordplay with Contrastive Learning and Phonetic-Semantic Embeddings2025-07-09Unconditional Diffusion for Generative Sequential Recommendation2025-07-08GRAFT: A Graph-based Flow-aware Agentic Framework for Document-level Machine Translation2025-07-04TransLaw: Benchmarking Large Language Models in Multi-Agent Simulation of the Collaborative Translation2025-07-01CycleVAR: Repurposing Autoregressive Model for Unsupervised One-Step Image Translation2025-06-29