MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval

Yuying Ge, Yixiao Ge, Xihui Liu, Alex Jinpeng Wang, Jianping Wu, Ying Shan, XiaoHu Qie, Ping Luo

2022-04-26Video Retrieval Video-Text Retrieval Zero-Shot Video Retrieval Text Retrieval Text to Video Retrieval Zero-Shot Action Recognition Action Recognition Retrieval Video to Text Retrieval

Paper PDF Code(official)

Abstract

Dominant pre-training work for video-text retrieval mainly adopt the "dual-encoder" architectures to enable efficient retrieval, where two separate encoders are used to contrast global video and text representations, but ignore detailed local semantics. The recent success of image BERT pre-training with masked visual modeling that promotes the learning of local visual context, motivates a possible solution to address the above limitation. In this work, we for the first time investigate masked visual modeling in video-text pre-training with the "dual-encoder" architecture. We perform Masked visual modeling with Injected LanguagE Semantics (MILES) by employing an extra snapshot video encoder as an evolving "tokenizer" to produce reconstruction targets for masked video patch prediction. Given the corrupted video, the video encoder is trained to recover text-aligned features of the masked patches via reasoning with the visible regions along the spatial and temporal dimensions, which enhances the discriminativeness of local visual features and the fine-grained cross-modality alignment. Our method outperforms state-of-the-art methods for text-to-video retrieval on four datasets with both zero-shot and fine-tune evaluation protocols. Our approach also surpasses the baseline models significantly on zero-shot action recognition, which can be cast as video-to-text retrieval.

Results

Task	Dataset	Metric	Value	Model
Zero-Shot Video Retrieval	MSR-VTT	text-to-video Median Rank	7	MILES
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@1	26.1	MILES
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@10	56.9	MILES
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@5	47.2	MILES
Zero-Shot Video Retrieval	MSVD	text-to-video Median Rank	2	MILES
Zero-Shot Video Retrieval	MSVD	text-to-video R@1	44.4	MILES
Zero-Shot Video Retrieval	MSVD	text-to-video R@10	87	MILES
Zero-Shot Video Retrieval	MSVD	text-to-video R@5	76.2	MILES
Zero-Shot Video Retrieval	DiDeMo	text-to-video Median Rank	5	MILES
Zero-Shot Video Retrieval	DiDeMo	text-to-video R@1	27.2	MILES
Zero-Shot Video Retrieval	DiDeMo	text-to-video R@10	63.6	MILES
Zero-Shot Video Retrieval	DiDeMo	text-to-video R@5	50.3	MILES
Zero-Shot Video Retrieval	LSMDC	text-to-video Median Rank	50.7	MILES
Zero-Shot Video Retrieval	LSMDC	text-to-video R@1	11.1	MILES
Zero-Shot Video Retrieval	LSMDC	text-to-video R@10	30.6	MILES
Zero-Shot Video Retrieval	LSMDC	text-to-video R@5	24.7	MILES

MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval

Abstract

Results

Related Papers

MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval

Abstract

Results

Related Papers