LaT: Latent Translation with Cycle-Consistency for Video-Text Retrieval

Jinbin Bai, Chunhui Liu, Feiyue Ni, Haofan Wang, Mengying Hu, Xiaofeng Guo, Lele Cheng

2022-07-11Video Retrieval Representation Learning Video-Text Retrieval Zero-Shot Video Retrieval Text Retrieval Translation Retrieval

Paper PDF

Abstract

Video-text retrieval is a class of cross-modal representation learning problems, where the goal is to select the video which corresponds to the text query between a given text query and a pool of candidate videos. The contrastive paradigm of vision-language pretraining has shown promising success with large-scale datasets and unified transformer architecture, and demonstrated the power of a joint latent space. Despite this, the intrinsic divergence between the visual domain and textual domain is still far from being eliminated, and projecting different modalities into a joint latent space might result in the distorting of the information inside the single modality. To overcome the above issue, we present a novel mechanism for learning the translation relationship from a source modality space $\mathcal{S}$ to a target modality space $\mathcal{T}$ without the need for a joint latent space, which bridges the gap between visual and textual domains. Furthermore, to keep cycle consistency between translations, we adopt a cycle loss involving both forward translations from $\mathcal{S}$ to the predicted target space $\mathcal{T'}$, and backward translations from $\mathcal{T'}$ back to $\mathcal{S}$. Extensive experiments conducted on MSR-VTT, MSVD, and DiDeMo datasets demonstrate the superiority and effectiveness of our LaT approach compared with vanilla state-of-the-art methods.

Results

Task	Dataset	Metric	Value	Model
Zero-Shot Video Retrieval	MSR-VTT	text-to-video Median Rank	8	LaT
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@1	23.4	LaT
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@10	53.3	LaT
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@5	44.1	LaT
Zero-Shot Video Retrieval	MSR-VTT	video-to-text Median Rank	12	LaT
Zero-Shot Video Retrieval	MSR-VTT	video-to-text R@1	17.2	LaT
Zero-Shot Video Retrieval	MSR-VTT	video-to-text R@10	47.9	LaT
Zero-Shot Video Retrieval	MSR-VTT	video-to-text R@5	36.2	LaT
Zero-Shot Video Retrieval	MSVD	text-to-video Median Rank	2	LaT
Zero-Shot Video Retrieval	MSVD	text-to-video R@1	36.9	LaT
Zero-Shot Video Retrieval	MSVD	text-to-video R@10	81	LaT
Zero-Shot Video Retrieval	MSVD	text-to-video R@5	68.6	LaT
Zero-Shot Video Retrieval	MSVD	video-to-text Median Rank	3	LaT
Zero-Shot Video Retrieval	MSVD	video-to-text R@1	34.4	LaT
Zero-Shot Video Retrieval	MSVD	video-to-text R@10	79.2	LaT
Zero-Shot Video Retrieval	MSVD	video-to-text R@5	69	LaT
Zero-Shot Video Retrieval	DiDeMo	text-to-video Median Rank	7	LaT
Zero-Shot Video Retrieval	DiDeMo	text-to-video R@1	22.6	LaT
Zero-Shot Video Retrieval	DiDeMo	text-to-video R@10	58.9	LaT
Zero-Shot Video Retrieval	DiDeMo	text-to-video R@5	45.9	LaT
Zero-Shot Video Retrieval	DiDeMo	video-to-text Median Rank	7	LaT
Zero-Shot Video Retrieval	DiDeMo	video-to-text R@1	22.5	LaT
Zero-Shot Video Retrieval	DiDeMo	video-to-text R@10	56.8	LaT
Zero-Shot Video Retrieval	DiDeMo	video-to-text R@5	45.2	LaT

Abstract

Results

Task	Dataset	Metric	Value	Model
Zero-Shot Video Retrieval	MSR-VTT	text-to-video Median Rank	8	LaT
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@1	23.4	LaT
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@10	53.3	LaT
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@5	44.1	LaT
Zero-Shot Video Retrieval	MSR-VTT	video-to-text Median Rank	12	LaT
Zero-Shot Video Retrieval	MSR-VTT	video-to-text R@1	17.2	LaT
Zero-Shot Video Retrieval	MSR-VTT	video-to-text R@10	47.9	LaT
Zero-Shot Video Retrieval	MSR-VTT	video-to-text R@5	36.2	LaT
Zero-Shot Video Retrieval	MSVD	text-to-video Median Rank	2	LaT
Zero-Shot Video Retrieval	MSVD	text-to-video R@1	36.9	LaT
Zero-Shot Video Retrieval	MSVD	text-to-video R@10	81	LaT
Zero-Shot Video Retrieval	MSVD	text-to-video R@5	68.6	LaT
Zero-Shot Video Retrieval	MSVD	video-to-text Median Rank	3	LaT
Zero-Shot Video Retrieval	MSVD	video-to-text R@1	34.4	LaT
Zero-Shot Video Retrieval	MSVD	video-to-text R@10	79.2	LaT
Zero-Shot Video Retrieval	MSVD	video-to-text R@5	69	LaT
Zero-Shot Video Retrieval	DiDeMo	text-to-video Median Rank	7	LaT
Zero-Shot Video Retrieval	DiDeMo	text-to-video R@1	22.6	LaT
Zero-Shot Video Retrieval	DiDeMo	text-to-video R@10	58.9	LaT
Zero-Shot Video Retrieval	DiDeMo	text-to-video R@5	45.9	LaT
Zero-Shot Video Retrieval	DiDeMo	video-to-text Median Rank	7	LaT
Zero-Shot Video Retrieval	DiDeMo	video-to-text R@1	22.5	LaT
Zero-Shot Video Retrieval	DiDeMo	video-to-text R@10	56.8	LaT
Zero-Shot Video Retrieval	DiDeMo	video-to-text R@5	45.2	LaT

LaT: Latent Translation with Cycle-Consistency for Video-Text Retrieval

Abstract

Results

Related Papers

LaT: Latent Translation with Cycle-Consistency for Video-Text Retrieval

Abstract

Results

Related Papers