MDMMT: Multidomain Multimodal Transformer for Video Retrieval

Maksim Dzabraev, Maksim Kalashnikov, Stepan Komkov, Aleksandr Petiushko

2021-03-19Video Retrieval Text to Video Retrieval Retrieval

Abstract

We present a new state-of-the-art on the text to video retrieval task on MSRVTT and LSMDC benchmarks where our model outperforms all previous solutions by a large margin. Moreover, state-of-the-art results are achieved with a single model on two datasets without finetuning. This multidomain generalisation is achieved by a proper combination of different video caption datasets. We show that training on different datasets can improve test results of each other. Additionally we check intersection between many popular datasets and found that MSRVTT has a significant overlap between the test and the train parts, and the same situation is observed for ActivityNet.

Results

Task	Dataset	Metric	Value	Model
Video	MSR-VTT-1kA	text-to-video Mean Rank	16.5	MDMMT
Video	MSR-VTT-1kA	text-to-video Median Rank	2	MDMMT
Video	MSR-VTT-1kA	text-to-video R@1	38.9	MDMMT
Video	MSR-VTT-1kA	text-to-video R@10	79.7	MDMMT
Video	MSR-VTT-1kA	text-to-video R@5	69	MDMMT
Video	MSR-VTT	text-to-video Mean Rank	52.8	MDMMT
Video	MSR-VTT	text-to-video Median Rank	6	MDMMT
Video	MSR-VTT	text-to-video R@1	23.1	MDMMT
Video	MSR-VTT	text-to-video R@10	61.8	MDMMT
Video	MSR-VTT	text-to-video R@5	49.8	MDMMT
Video	LSMDC	text-to-video Mean Rank	58	MDMMT
Video	LSMDC	text-to-video Median Rank	12.3	MDMMT
Video	LSMDC	text-to-video R@1	18.8	MDMMT
Video	LSMDC	text-to-video R@10	47.9	MDMMT
Video	LSMDC	text-to-video R@5	38.5	MDMMT
Video Retrieval	MSR-VTT-1kA	text-to-video Mean Rank	16.5	MDMMT
Video Retrieval	MSR-VTT-1kA	text-to-video Median Rank	2	MDMMT
Video Retrieval	MSR-VTT-1kA	text-to-video R@1	38.9	MDMMT
Video Retrieval	MSR-VTT-1kA	text-to-video R@10	79.7	MDMMT
Video Retrieval	MSR-VTT-1kA	text-to-video R@5	69	MDMMT
Video Retrieval	MSR-VTT	text-to-video Mean Rank	52.8	MDMMT
Video Retrieval	MSR-VTT	text-to-video Median Rank	6	MDMMT
Video Retrieval	MSR-VTT	text-to-video R@1	23.1	MDMMT
Video Retrieval	MSR-VTT	text-to-video R@10	61.8	MDMMT
Video Retrieval	MSR-VTT	text-to-video R@5	49.8	MDMMT
Video Retrieval	LSMDC	text-to-video Mean Rank	58	MDMMT
Video Retrieval	LSMDC	text-to-video Median Rank	12.3	MDMMT
Video Retrieval	LSMDC	text-to-video R@1	18.8	MDMMT
Video Retrieval	LSMDC	text-to-video R@10	47.9	MDMMT
Video Retrieval	LSMDC	text-to-video R@5	38.5	MDMMT

MDMMT: Multidomain Multimodal Transformer for Video Retrieval

Abstract

Results

Related Papers

MDMMT: Multidomain Multimodal Transformer for Video Retrieval

Abstract

Results

Related Papers