TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/MDMMT: Multidomain Multimodal Transformer for Video Retrie...

MDMMT: Multidomain Multimodal Transformer for Video Retrieval

Maksim Dzabraev, Maksim Kalashnikov, Stepan Komkov, Aleksandr Petiushko

2021-03-19Video RetrievalText to Video RetrievalRetrieval
PaperPDFCode(official)CodeCode

Abstract

We present a new state-of-the-art on the text to video retrieval task on MSRVTT and LSMDC benchmarks where our model outperforms all previous solutions by a large margin. Moreover, state-of-the-art results are achieved with a single model on two datasets without finetuning. This multidomain generalisation is achieved by a proper combination of different video caption datasets. We show that training on different datasets can improve test results of each other. Additionally we check intersection between many popular datasets and found that MSRVTT has a significant overlap between the test and the train parts, and the same situation is observed for ActivityNet.

Results

TaskDatasetMetricValueModel
VideoMSR-VTT-1kAtext-to-video Mean Rank16.5MDMMT
VideoMSR-VTT-1kAtext-to-video Median Rank2MDMMT
VideoMSR-VTT-1kAtext-to-video R@138.9MDMMT
VideoMSR-VTT-1kAtext-to-video R@1079.7MDMMT
VideoMSR-VTT-1kAtext-to-video R@569MDMMT
VideoMSR-VTTtext-to-video Mean Rank52.8MDMMT
VideoMSR-VTTtext-to-video Median Rank6MDMMT
VideoMSR-VTTtext-to-video R@123.1MDMMT
VideoMSR-VTTtext-to-video R@1061.8MDMMT
VideoMSR-VTTtext-to-video R@549.8MDMMT
VideoLSMDCtext-to-video Mean Rank58MDMMT
VideoLSMDCtext-to-video Median Rank12.3MDMMT
VideoLSMDCtext-to-video R@118.8MDMMT
VideoLSMDCtext-to-video R@1047.9MDMMT
VideoLSMDCtext-to-video R@538.5MDMMT
Video RetrievalMSR-VTT-1kAtext-to-video Mean Rank16.5MDMMT
Video RetrievalMSR-VTT-1kAtext-to-video Median Rank2MDMMT
Video RetrievalMSR-VTT-1kAtext-to-video R@138.9MDMMT
Video RetrievalMSR-VTT-1kAtext-to-video R@1079.7MDMMT
Video RetrievalMSR-VTT-1kAtext-to-video R@569MDMMT
Video RetrievalMSR-VTTtext-to-video Mean Rank52.8MDMMT
Video RetrievalMSR-VTTtext-to-video Median Rank6MDMMT
Video RetrievalMSR-VTTtext-to-video R@123.1MDMMT
Video RetrievalMSR-VTTtext-to-video R@1061.8MDMMT
Video RetrievalMSR-VTTtext-to-video R@549.8MDMMT
Video RetrievalLSMDCtext-to-video Mean Rank58MDMMT
Video RetrievalLSMDCtext-to-video Median Rank12.3MDMMT
Video RetrievalLSMDCtext-to-video R@118.8MDMMT
Video RetrievalLSMDCtext-to-video R@1047.9MDMMT
Video RetrievalLSMDCtext-to-video R@538.5MDMMT

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16Context-Aware Search and Retrieval Over Erasure Channels2025-07-16Seq vs Seq: An Open Suite of Paired Encoders and Decoders2025-07-15