Bridging Video-text Retrieval with Multiple Choice Questions

Yuying Ge, Yixiao Ge, Xihui Liu, Dian Li, Ying Shan, XiaoHu Qie, Ping Luo

2022-01-13CVPR 2022 1Video Retrieval Video-Text Retrieval Zero-Shot Video Retrieval Text Retrieval Text-to-video search Text to Video Retrieval Zero-Shot Action Recognition Action Recognition Retrieval Multiple-choice Linear evaluation Video to Text Retrieval

Paper PDF Code(official)Code

Abstract

Pre-training a model to learn transferable video-text representation for retrieval has attracted a lot of attention in recent years. Previous dominant works mainly adopt two separate encoders for efficient retrieval, but ignore local associations between videos and texts. Another line of research uses a joint encoder to interact video with texts, but results in low efficiency since each text-video pair needs to be fed into the model. In this work, we enable fine-grained video-text interactions while maintaining high efficiency for retrieval via a novel pretext task, dubbed as Multiple Choice Questions (MCQ), where a parametric module BridgeFormer is trained to answer the "questions" constructed by the text features via resorting to the video features. Specifically, we exploit the rich semantics of text (i.e., nouns and verbs) to build questions, with which the video encoder can be trained to capture more regional content and temporal dynamics. In the form of questions and answers, the semantic associations between local video-text features can be properly established. BridgeFormer is able to be removed for downstream retrieval, rendering an efficient and flexible model with only two encoders. Our method outperforms state-of-the-art methods on the popular text-to-video retrieval task in five datasets with different experimental setups (i.e., zero-shot and fine-tune), including HowTo100M (one million videos). We further conduct zero-shot action recognition, which can be cast as video-to-text retrieval, and our approach also significantly surpasses its counterparts. As an additional benefit, our method achieves competitive results with much shorter pre-training videos on single-modality downstream tasks, e.g., action recognition with linear evaluation.

Results

Task	Dataset	Metric	Value	Model
Video	MSR-VTT-1kA	text-to-video Median Rank	3	BridgeFormer
Video	MSR-VTT-1kA	text-to-video R@1	37.6	BridgeFormer
Video	MSR-VTT-1kA	text-to-video R@10	75.1	BridgeFormer
Video	MSR-VTT-1kA	text-to-video R@5	64.8	BridgeFormer
Video	MSR-VTT-1kA	text-to-video Median Rank	7	BridgeFormer (Zero-shot)
Video	MSR-VTT-1kA	text-to-video R@1	26	BridgeFormer (Zero-shot)
Video	MSR-VTT-1kA	text-to-video R@10	56.4	BridgeFormer (Zero-shot)
Video	MSR-VTT-1kA	text-to-video R@5	46.4	BridgeFormer (Zero-shot)
Video Retrieval	MSR-VTT-1kA	text-to-video Median Rank	3	BridgeFormer
Video Retrieval	MSR-VTT-1kA	text-to-video R@1	37.6	BridgeFormer
Video Retrieval	MSR-VTT-1kA	text-to-video R@10	75.1	BridgeFormer
Video Retrieval	MSR-VTT-1kA	text-to-video R@5	64.8	BridgeFormer
Video Retrieval	MSR-VTT-1kA	text-to-video Median Rank	7	BridgeFormer (Zero-shot)
Video Retrieval	MSR-VTT-1kA	text-to-video R@1	26	BridgeFormer (Zero-shot)
Video Retrieval	MSR-VTT-1kA	text-to-video R@10	56.4	BridgeFormer (Zero-shot)
Video Retrieval	MSR-VTT-1kA	text-to-video R@5	46.4	BridgeFormer (Zero-shot)
Zero-Shot Video Retrieval	MSR-VTT	text-to-video Median Rank	7	Y. Ge et. al.
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@1	26	Y. Ge et. al.
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@10	56.4	Y. Ge et. al.
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@5	46.4	Y. Ge et. al.
Zero-Shot Video Retrieval	MSVD	text-to-video Median Rank	2	Y. Ge et. al.
Zero-Shot Video Retrieval	MSVD	text-to-video R@1	43.6	Y. Ge et. al.
Zero-Shot Video Retrieval	MSVD	text-to-video R@10	84.9	Y. Ge et. al.
Zero-Shot Video Retrieval	MSVD	text-to-video R@5	74.9	Y. Ge et. al.
Zero-Shot Video Retrieval	DiDeMo	text-to-video Median Rank	5	Y. Ge et. al.
Zero-Shot Video Retrieval	DiDeMo	text-to-video R@1	25.6	Y. Ge et. al.
Zero-Shot Video Retrieval	DiDeMo	text-to-video R@10	61.1	Y. Ge et. al.
Zero-Shot Video Retrieval	DiDeMo	text-to-video R@5	50.6	Y. Ge et. al.
Zero-Shot Video Retrieval	LSMDC	text-to-video Median Rank	42	Y. Ge et. al.
Zero-Shot Video Retrieval	LSMDC	text-to-video R@1	12.2	Y. Ge et. al.
Zero-Shot Video Retrieval	LSMDC	text-to-video R@10	32.2	Y. Ge et. al.
Zero-Shot Video Retrieval	LSMDC	text-to-video R@5	25.9	Y. Ge et. al.

Abstract

Results

Task	Dataset	Metric	Value	Model
Video	MSR-VTT-1kA	text-to-video Median Rank	3	BridgeFormer
Video	MSR-VTT-1kA	text-to-video R@1	37.6	BridgeFormer
Video	MSR-VTT-1kA	text-to-video R@10	75.1	BridgeFormer
Video	MSR-VTT-1kA	text-to-video R@5	64.8	BridgeFormer
Video	MSR-VTT-1kA	text-to-video Median Rank	7	BridgeFormer (Zero-shot)
Video	MSR-VTT-1kA	text-to-video R@1	26	BridgeFormer (Zero-shot)
Video	MSR-VTT-1kA	text-to-video R@10	56.4	BridgeFormer (Zero-shot)
Video	MSR-VTT-1kA	text-to-video R@5	46.4	BridgeFormer (Zero-shot)
Video Retrieval	MSR-VTT-1kA	text-to-video Median Rank	3	BridgeFormer
Video Retrieval	MSR-VTT-1kA	text-to-video R@1	37.6	BridgeFormer
Video Retrieval	MSR-VTT-1kA	text-to-video R@10	75.1	BridgeFormer
Video Retrieval	MSR-VTT-1kA	text-to-video R@5	64.8	BridgeFormer
Video Retrieval	MSR-VTT-1kA	text-to-video Median Rank	7	BridgeFormer (Zero-shot)
Video Retrieval	MSR-VTT-1kA	text-to-video R@1	26	BridgeFormer (Zero-shot)
Video Retrieval	MSR-VTT-1kA	text-to-video R@10	56.4	BridgeFormer (Zero-shot)
Video Retrieval	MSR-VTT-1kA	text-to-video R@5	46.4	BridgeFormer (Zero-shot)
Zero-Shot Video Retrieval	MSR-VTT	text-to-video Median Rank	7	Y. Ge et. al.
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@1	26	Y. Ge et. al.
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@10	56.4	Y. Ge et. al.
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@5	46.4	Y. Ge et. al.
Zero-Shot Video Retrieval	MSVD	text-to-video Median Rank	2	Y. Ge et. al.
Zero-Shot Video Retrieval	MSVD	text-to-video R@1	43.6	Y. Ge et. al.
Zero-Shot Video Retrieval	MSVD	text-to-video R@10	84.9	Y. Ge et. al.
Zero-Shot Video Retrieval	MSVD	text-to-video R@5	74.9	Y. Ge et. al.
Zero-Shot Video Retrieval	DiDeMo	text-to-video Median Rank	5	Y. Ge et. al.
Zero-Shot Video Retrieval	DiDeMo	text-to-video R@1	25.6	Y. Ge et. al.
Zero-Shot Video Retrieval	DiDeMo	text-to-video R@10	61.1	Y. Ge et. al.
Zero-Shot Video Retrieval	DiDeMo	text-to-video R@5	50.6	Y. Ge et. al.
Zero-Shot Video Retrieval	LSMDC	text-to-video Median Rank	42	Y. Ge et. al.
Zero-Shot Video Retrieval	LSMDC	text-to-video R@1	12.2	Y. Ge et. al.
Zero-Shot Video Retrieval	LSMDC	text-to-video R@10	32.2	Y. Ge et. al.
Zero-Shot Video Retrieval	LSMDC	text-to-video R@5	25.9	Y. Ge et. al.

Bridging Video-text Retrieval with Multiple Choice Questions

Abstract

Results

Related Papers

Bridging Video-text Retrieval with Multiple Choice Questions

Abstract

Results

Related Papers