MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization

Alexander Kunitsyn, Maksim Kalashnikov, Maksim Dzabraev, Andrei Ivaniuta

2022-03-14Video Retrieval Text to Video Retrieval Retrieval

Abstract

In this work we present a new State-of-The-Art on the text-to-video retrieval task on MSR-VTT, LSMDC, MSVD, YouCook2 and TGIF obtained by a single model. Three different data sources are combined: weakly-supervised videos, crowd-labeled text-image pairs and text-video pairs. A careful analysis of available pre-trained networks helps to choose the best prior-knowledge ones. We introduce three-stage training procedure that provides high transfer knowledge efficiency and allows to use noisy datasets during training without prior knowledge degradation. Additionally, double positional encoding is used for better fusion of different modalities and a simple method for non-square inputs processing is suggested.

Results

Task	Dataset	Metric	Value	Model
Video	YouCook2	text-to-video Mean Rank	12.7	MDMMT-2
Video	YouCook2	text-to-video Median Rank	3	MDMMT-2
Video	YouCook2	text-to-video R@1	32	MDMMT-2
Video	YouCook2	text-to-video R@10	74.8	MDMMT-2
Video	YouCook2	text-to-video R@5	64	MDMMT-2
Video	MSR-VTT	text-to-video Mean Rank	37.8	MDMMT-2
Video	MSR-VTT	text-to-video Median Rank	3	MDMMT-2
Video	MSR-VTT	text-to-video R@1	33.7	MDMMT-2
Video	MSR-VTT	text-to-video R@10	70.8	MDMMT-2
Video	MSR-VTT	text-to-video R@5	60.5	MDMMT-2
Video	LSMDC	text-to-video Mean Rank	48	MDMMT-2
Video	LSMDC	text-to-video Median Rank	6.7	MDMMT-2
Video	LSMDC	text-to-video R@1	26.9	MDMMT-2
Video	LSMDC	text-to-video R@10	55.9	MDMMT-2
Video	LSMDC	text-to-video R@5	46.7	MDMMT-2
Video	TGIF	text-to-video Mean Rank	94.1	MDMMT-2
Video	TGIF	text-to-video Median Rank	7	MDMMT-2
Video	TGIF	text-to-video R@1	25.5	MDMMT-2
Video	TGIF	text-to-video R@10	55.7	MDMMT-2
Video	TGIF	text-to-video R@5	46.1	MDMMT-2
Video	MSVD	text-to-video Mean Rank	8.8	MDMMT-2
Video	MSVD	text-to-video Median Rank	1	MDMMT-2
Video	MSVD	text-to-video R@1	56.8	MDMMT-2
Video	MSVD	text-to-video R@10	89.2	MDMMT-2
Video	MSVD	text-to-video R@5	83.1	MDMMT-2
Video Retrieval	YouCook2	text-to-video Mean Rank	12.7	MDMMT-2
Video Retrieval	YouCook2	text-to-video Median Rank	3	MDMMT-2
Video Retrieval	YouCook2	text-to-video R@1	32	MDMMT-2
Video Retrieval	YouCook2	text-to-video R@10	74.8	MDMMT-2
Video Retrieval	YouCook2	text-to-video R@5	64	MDMMT-2
Video Retrieval	MSR-VTT	text-to-video Mean Rank	37.8	MDMMT-2
Video Retrieval	MSR-VTT	text-to-video Median Rank	3	MDMMT-2
Video Retrieval	MSR-VTT	text-to-video R@1	33.7	MDMMT-2
Video Retrieval	MSR-VTT	text-to-video R@10	70.8	MDMMT-2
Video Retrieval	MSR-VTT	text-to-video R@5	60.5	MDMMT-2
Video Retrieval	LSMDC	text-to-video Mean Rank	48	MDMMT-2
Video Retrieval	LSMDC	text-to-video Median Rank	6.7	MDMMT-2
Video Retrieval	LSMDC	text-to-video R@1	26.9	MDMMT-2
Video Retrieval	LSMDC	text-to-video R@10	55.9	MDMMT-2
Video Retrieval	LSMDC	text-to-video R@5	46.7	MDMMT-2
Video Retrieval	TGIF	text-to-video Mean Rank	94.1	MDMMT-2
Video Retrieval	TGIF	text-to-video Median Rank	7	MDMMT-2
Video Retrieval	TGIF	text-to-video R@1	25.5	MDMMT-2
Video Retrieval	TGIF	text-to-video R@10	55.7	MDMMT-2
Video Retrieval	TGIF	text-to-video R@5	46.1	MDMMT-2
Video Retrieval	MSVD	text-to-video Mean Rank	8.8	MDMMT-2
Video Retrieval	MSVD	text-to-video Median Rank	1	MDMMT-2
Video Retrieval	MSVD	text-to-video R@1	56.8	MDMMT-2
Video Retrieval	MSVD	text-to-video R@10	89.2	MDMMT-2
Video Retrieval	MSVD	text-to-video R@5	83.1	MDMMT-2

Abstract

Results

Task	Dataset	Metric	Value	Model
Video	YouCook2	text-to-video Mean Rank	12.7	MDMMT-2
Video	YouCook2	text-to-video Median Rank	3	MDMMT-2
Video	YouCook2	text-to-video R@1	32	MDMMT-2
Video	YouCook2	text-to-video R@10	74.8	MDMMT-2
Video	YouCook2	text-to-video R@5	64	MDMMT-2
Video	MSR-VTT	text-to-video Mean Rank	37.8	MDMMT-2
Video	MSR-VTT	text-to-video Median Rank	3	MDMMT-2
Video	MSR-VTT	text-to-video R@1	33.7	MDMMT-2
Video	MSR-VTT	text-to-video R@10	70.8	MDMMT-2
Video	MSR-VTT	text-to-video R@5	60.5	MDMMT-2
Video	LSMDC	text-to-video Mean Rank	48	MDMMT-2
Video	LSMDC	text-to-video Median Rank	6.7	MDMMT-2
Video	LSMDC	text-to-video R@1	26.9	MDMMT-2
Video	LSMDC	text-to-video R@10	55.9	MDMMT-2
Video	LSMDC	text-to-video R@5	46.7	MDMMT-2
Video	TGIF	text-to-video Mean Rank	94.1	MDMMT-2
Video	TGIF	text-to-video Median Rank	7	MDMMT-2
Video	TGIF	text-to-video R@1	25.5	MDMMT-2
Video	TGIF	text-to-video R@10	55.7	MDMMT-2
Video	TGIF	text-to-video R@5	46.1	MDMMT-2
Video	MSVD	text-to-video Mean Rank	8.8	MDMMT-2
Video	MSVD	text-to-video Median Rank	1	MDMMT-2
Video	MSVD	text-to-video R@1	56.8	MDMMT-2
Video	MSVD	text-to-video R@10	89.2	MDMMT-2
Video	MSVD	text-to-video R@5	83.1	MDMMT-2
Video Retrieval	YouCook2	text-to-video Mean Rank	12.7	MDMMT-2
Video Retrieval	YouCook2	text-to-video Median Rank	3	MDMMT-2
Video Retrieval	YouCook2	text-to-video R@1	32	MDMMT-2
Video Retrieval	YouCook2	text-to-video R@10	74.8	MDMMT-2
Video Retrieval	YouCook2	text-to-video R@5	64	MDMMT-2
Video Retrieval	MSR-VTT	text-to-video Mean Rank	37.8	MDMMT-2
Video Retrieval	MSR-VTT	text-to-video Median Rank	3	MDMMT-2
Video Retrieval	MSR-VTT	text-to-video R@1	33.7	MDMMT-2
Video Retrieval	MSR-VTT	text-to-video R@10	70.8	MDMMT-2
Video Retrieval	MSR-VTT	text-to-video R@5	60.5	MDMMT-2
Video Retrieval	LSMDC	text-to-video Mean Rank	48	MDMMT-2
Video Retrieval	LSMDC	text-to-video Median Rank	6.7	MDMMT-2
Video Retrieval	LSMDC	text-to-video R@1	26.9	MDMMT-2
Video Retrieval	LSMDC	text-to-video R@10	55.9	MDMMT-2
Video Retrieval	LSMDC	text-to-video R@5	46.7	MDMMT-2
Video Retrieval	TGIF	text-to-video Mean Rank	94.1	MDMMT-2
Video Retrieval	TGIF	text-to-video Median Rank	7	MDMMT-2
Video Retrieval	TGIF	text-to-video R@1	25.5	MDMMT-2
Video Retrieval	TGIF	text-to-video R@10	55.7	MDMMT-2
Video Retrieval	TGIF	text-to-video R@5	46.1	MDMMT-2
Video Retrieval	MSVD	text-to-video Mean Rank	8.8	MDMMT-2
Video Retrieval	MSVD	text-to-video Median Rank	1	MDMMT-2
Video Retrieval	MSVD	text-to-video R@1	56.8	MDMMT-2
Video Retrieval	MSVD	text-to-video R@10	89.2	MDMMT-2
Video Retrieval	MSVD	text-to-video R@5	83.1	MDMMT-2

MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization

Abstract

Results

Related Papers

MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization

Abstract

Results

Related Papers