TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Multi-modal Transformer for Video Retrieval

Multi-modal Transformer for Video Retrieval

Valentin Gabeur, Chen Sun, Karteek Alahari, Cordelia Schmid

2020-07-21ECCV 2020 8Video RetrievalZero-Shot Video RetrievalRetrievalNatural Language Queries
PaperPDFCode

Abstract

The task of retrieving video content relevant to natural language queries plays a critical role in effectively handling internet-scale datasets. Most of the existing methods for this caption-to-video retrieval problem do not fully exploit cross-modal cues present in video. Furthermore, they aggregate per-frame visual features with limited or no temporal information. In this paper, we present a multi-modal transformer to jointly encode the different modalities in video, which allows each of them to attend to the others. The transformer architecture is also leveraged to encode and model the temporal information. On the natural language side, we investigate the best practices to jointly optimize the language embedding together with the multi-modal transformer. This novel framework allows us to establish state-of-the-art results for video retrieval on three datasets. More details are available at http://thoth.inrialpes.fr/research/MMT.

Results

TaskDatasetMetricValueModel
VideoMSR-VTT-1kAtext-to-video Mean Rank24MMT-Pretrained
VideoMSR-VTT-1kAtext-to-video Median Rank4MMT-Pretrained
VideoMSR-VTT-1kAtext-to-video R@126.6MMT-Pretrained
VideoMSR-VTT-1kAtext-to-video R@1069.6MMT-Pretrained
VideoMSR-VTT-1kAtext-to-video R@557.1MMT-Pretrained
VideoMSR-VTT-1kAtext-to-video Mean Rank26.7MMT
VideoMSR-VTT-1kAtext-to-video Median Rank4MMT
VideoMSR-VTT-1kAtext-to-video R@124.6MMT
VideoMSR-VTT-1kAtext-to-video R@1067.1MMT
VideoMSR-VTT-1kAtext-to-video R@554MMT
VideoActivityNettext-to-video Mean Rank16MMT-Pretrained
VideoActivityNettext-to-video Median Rank3.3MMT-Pretrained
VideoActivityNettext-to-video R@128.7MMT-Pretrained
VideoActivityNettext-to-video R@561.4MMT-Pretrained
VideoActivityNettext-to-video R@5094.5MMT-Pretrained
VideoActivityNettext-to-video Mean Rank20.8MMT
VideoActivityNettext-to-video Median Rank5MMT
VideoActivityNettext-to-video R@122.7MMT
VideoActivityNettext-to-video R@554.2MMT
VideoActivityNettext-to-video R@5093.2MMT
VideoLSMDCtext-to-video Median Rank19.3MMT-Pretrained
VideoLSMDCtext-to-video R@113.5MMT-Pretrained
VideoLSMDCtext-to-video R@1040.1MMT-Pretrained
VideoLSMDCtext-to-video R@529.9MMT-Pretrained
VideoLSMDCtext-to-video Median Rank21MMT
VideoLSMDCtext-to-video R@113.2MMT
VideoLSMDCtext-to-video R@1038.8MMT
VideoLSMDCtext-to-video R@529.2MMT
Video RetrievalMSR-VTT-1kAtext-to-video Mean Rank24MMT-Pretrained
Video RetrievalMSR-VTT-1kAtext-to-video Median Rank4MMT-Pretrained
Video RetrievalMSR-VTT-1kAtext-to-video R@126.6MMT-Pretrained
Video RetrievalMSR-VTT-1kAtext-to-video R@1069.6MMT-Pretrained
Video RetrievalMSR-VTT-1kAtext-to-video R@557.1MMT-Pretrained
Video RetrievalMSR-VTT-1kAtext-to-video Mean Rank26.7MMT
Video RetrievalMSR-VTT-1kAtext-to-video Median Rank4MMT
Video RetrievalMSR-VTT-1kAtext-to-video R@124.6MMT
Video RetrievalMSR-VTT-1kAtext-to-video R@1067.1MMT
Video RetrievalMSR-VTT-1kAtext-to-video R@554MMT
Video RetrievalActivityNettext-to-video Mean Rank16MMT-Pretrained
Video RetrievalActivityNettext-to-video Median Rank3.3MMT-Pretrained
Video RetrievalActivityNettext-to-video R@128.7MMT-Pretrained
Video RetrievalActivityNettext-to-video R@561.4MMT-Pretrained
Video RetrievalActivityNettext-to-video R@5094.5MMT-Pretrained
Video RetrievalActivityNettext-to-video Mean Rank20.8MMT
Video RetrievalActivityNettext-to-video Median Rank5MMT
Video RetrievalActivityNettext-to-video R@122.7MMT
Video RetrievalActivityNettext-to-video R@554.2MMT
Video RetrievalActivityNettext-to-video R@5093.2MMT
Video RetrievalLSMDCtext-to-video Median Rank19.3MMT-Pretrained
Video RetrievalLSMDCtext-to-video R@113.5MMT-Pretrained
Video RetrievalLSMDCtext-to-video R@1040.1MMT-Pretrained
Video RetrievalLSMDCtext-to-video R@529.9MMT-Pretrained
Video RetrievalLSMDCtext-to-video Median Rank21MMT
Video RetrievalLSMDCtext-to-video R@113.2MMT
Video RetrievalLSMDCtext-to-video R@1038.8MMT
Video RetrievalLSMDCtext-to-video R@529.2MMT
Zero-Shot Video RetrievalMSR-VTTtext-to-video Mean Rank148.1MMT
Zero-Shot Video RetrievalMSR-VTTtext-to-video Median Rank66MMT
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@514.4MMT

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16Context-Aware Search and Retrieval Over Erasure Channels2025-07-16Seq vs Seq: An Open Suite of Paired Encoders and Decoders2025-07-15