TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/X-Pool: Cross-Modal Language-Video Attention for Text-Vide...

X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval

Satya Krishna Gorti, Noel Vouitsis, Junwei Ma, Keyvan Golestan, Maksims Volkovs, Animesh Garg, Guangwei Yu

2022-03-28CVPR 2022 1Video RetrievalVideo-Text RetrievalText to Video RetrievalRetrieval
PaperPDFCode(official)

Abstract

In text-video retrieval, the objective is to learn a cross-modal similarity function between a text and a video that ranks relevant text-video pairs higher than irrelevant pairs. However, videos inherently express a much wider gamut of information than texts. Instead, texts often capture sub-regions of entire videos and are most semantically similar to certain frames within videos. Therefore, for a given text, a retrieval model should focus on the text's most semantically similar video sub-regions to make a more relevant comparison. Yet, most existing works aggregate entire videos without directly considering text. Common text-agnostic aggregations schemes include mean-pooling or self-attention over the frames, but these are likely to encode misleading visual information not described in the given text. To address this, we propose a cross-modal attention model called X-Pool that reasons between a text and the frames of a video. Our core mechanism is a scaled dot product attention for a text to attend to its most semantically similar frames. We then generate an aggregated video representation conditioned on the text's attention weights over the frames. We evaluate our method on three benchmark datasets of MSR-VTT, MSVD and LSMDC, achieving new state-of-the-art results by up to 12% in relative improvement in Recall@1. Our findings thereby highlight the importance of joint text-video reasoning to extract important visual cues according to text. Full code and demo can be found at: https://layer6ai-labs.github.io/xpool/

Results

TaskDatasetMetricValueModel
VideoMSR-VTT-1kAtext-to-video Mean Rank14.3X-Pool
VideoMSR-VTT-1kAtext-to-video Median Rank2X-Pool
VideoMSR-VTT-1kAtext-to-video R@146.9X-Pool
VideoMSR-VTT-1kAtext-to-video R@1082.2X-Pool
VideoMSR-VTT-1kAtext-to-video R@572.8X-Pool
VideoMSR-VTT-1kAvideo-to-text Mean Rank9X-Pool
VideoMSR-VTT-1kAvideo-to-text Median Rank2X-Pool
VideoMSR-VTT-1kAvideo-to-text R@144.4X-Pool
VideoMSR-VTT-1kAvideo-to-text R@1084X-Pool
VideoMSR-VTT-1kAvideo-to-text R@573.3X-Pool
VideoLSMDCtext-to-video Mean Rank53.2X-Pool
VideoLSMDCtext-to-video Median Rank8X-Pool
VideoLSMDCtext-to-video R@125.2X-Pool
VideoLSMDCtext-to-video R@1053.5X-Pool
VideoLSMDCtext-to-video R@543.7X-Pool
VideoLSMDCvideo-to-text Mean Rank47.4X-Pool
VideoLSMDCvideo-to-text Median Rank10X-Pool
VideoLSMDCvideo-to-text R@122.7X-Pool
VideoLSMDCvideo-to-text R@1051.2X-Pool
VideoLSMDCvideo-to-text R@542.6X-Pool
VideoMSVDtext-to-video Mean Rank9.3X-Pool
VideoMSVDtext-to-video Median Rank2X-Pool
VideoMSVDtext-to-video R@147.2X-Pool
VideoMSVDtext-to-video R@1086X-Pool
VideoMSVDtext-to-video R@577.4X-Pool
VideoMSVDvideo-to-text Mean Rank3.3X-Pool
VideoMSVDvideo-to-text Median Rank1X-Pool
VideoMSVDvideo-to-text R@166.4X-Pool
VideoMSVDvideo-to-text R@1094.2X-Pool
VideoMSVDvideo-to-text R@590X-Pool
Video RetrievalMSR-VTT-1kAtext-to-video Mean Rank14.3X-Pool
Video RetrievalMSR-VTT-1kAtext-to-video Median Rank2X-Pool
Video RetrievalMSR-VTT-1kAtext-to-video R@146.9X-Pool
Video RetrievalMSR-VTT-1kAtext-to-video R@1082.2X-Pool
Video RetrievalMSR-VTT-1kAtext-to-video R@572.8X-Pool
Video RetrievalMSR-VTT-1kAvideo-to-text Mean Rank9X-Pool
Video RetrievalMSR-VTT-1kAvideo-to-text Median Rank2X-Pool
Video RetrievalMSR-VTT-1kAvideo-to-text R@144.4X-Pool
Video RetrievalMSR-VTT-1kAvideo-to-text R@1084X-Pool
Video RetrievalMSR-VTT-1kAvideo-to-text R@573.3X-Pool
Video RetrievalLSMDCtext-to-video Mean Rank53.2X-Pool
Video RetrievalLSMDCtext-to-video Median Rank8X-Pool
Video RetrievalLSMDCtext-to-video R@125.2X-Pool
Video RetrievalLSMDCtext-to-video R@1053.5X-Pool
Video RetrievalLSMDCtext-to-video R@543.7X-Pool
Video RetrievalLSMDCvideo-to-text Mean Rank47.4X-Pool
Video RetrievalLSMDCvideo-to-text Median Rank10X-Pool
Video RetrievalLSMDCvideo-to-text R@122.7X-Pool
Video RetrievalLSMDCvideo-to-text R@1051.2X-Pool
Video RetrievalLSMDCvideo-to-text R@542.6X-Pool
Video RetrievalMSVDtext-to-video Mean Rank9.3X-Pool
Video RetrievalMSVDtext-to-video Median Rank2X-Pool
Video RetrievalMSVDtext-to-video R@147.2X-Pool
Video RetrievalMSVDtext-to-video R@1086X-Pool
Video RetrievalMSVDtext-to-video R@577.4X-Pool
Video RetrievalMSVDvideo-to-text Mean Rank3.3X-Pool
Video RetrievalMSVDvideo-to-text Median Rank1X-Pool
Video RetrievalMSVDvideo-to-text R@166.4X-Pool
Video RetrievalMSVDvideo-to-text R@1094.2X-Pool
Video RetrievalMSVDvideo-to-text R@590X-Pool

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16Context-Aware Search and Retrieval Over Erasure Channels2025-07-16Seq vs Seq: An Open Suite of Paired Encoders and Decoders2025-07-15