An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling

Tsu-Jui Fu, Linjie Li, Zhe Gan, Kevin Lin, William Yang Wang, Lijuan Wang, Zicheng Liu

2022-09-04CVPR 2023 1Question Answering Fill Mask Video Retrieval Optical Flow Estimation Text to Video Retrieval Video Question Answering Video Captioning TGIF-Transition Retrieval Visual Question Answering (VQA)TGIF-Action TGIF-Frame

Paper PDF Code(official)

Abstract

Masked visual modeling (MVM) has been recently proven effective for visual pre-training. While similar reconstructive objectives on video inputs (e.g., masked frame modeling) have been explored in video-language (VidL) pre-training, previous studies fail to find a truly effective MVM strategy that can largely benefit the downstream performance. In this work, we systematically examine the potential of MVM in the context of VidL learning. Specifically, we base our study on a fully end-to-end VIdeO-LanguagE Transformer (VIOLET), where the supervision from MVM training can be backpropagated to the video pixel space. In total, eight different reconstructive targets of MVM are explored, from low-level pixel values and oriented gradients to high-level depth maps, optical flow, discrete visual tokens, and latent visual features. We conduct comprehensive experiments and provide insights into the factors leading to effective MVM training, resulting in an enhanced model VIOLETv2. Empirically, we show VIOLETv2 pre-trained with MVM objective achieves notable improvements on 13 VidL benchmarks, ranging from video question answering, video captioning, to text-to-video retrieval.

Results

Task	Dataset	Metric	Value	Model
Video	DiDeMo	text-to-video R@1	47.9	VIOLETv2
Video	DiDeMo	text-to-video R@10	84.1	VIOLETv2
Video	DiDeMo	text-to-video R@5	76.5	VIOLETv2
Video	MSR-VTT	text-to-video R@1	37.2	VIOLETv2
Video	MSR-VTT	text-to-video R@10	75.8	VIOLETv2
Video	MSR-VTT	text-to-video R@5	64.8	VIOLETv2
Video	LSMDC	text-to-video R@1	24	VIOLETv2
Video	LSMDC	text-to-video R@10	54.1	VIOLETv2
Video	LSMDC	text-to-video R@5	43.5	VIOLETv2
Visual Question Answering (VQA)	MSVD-QA	Accuracy	0.547	VIOLETv2
Video Question Answering	MSRVTT-QA	Accuracy	44.5	VIOLETv2
Video Question Answering	LSMDC-MC	Accuracy	84.4	VIOLETv2
Video Question Answering	MSRVTT-MC	Accuracy	97.6	VIOLETv2
Video Captioning	MSR-VTT	CIDEr	58	VIOLETv2
Video Captioning	MSVD	CIDEr	139.2	VIOLETv2
Video Retrieval	DiDeMo	text-to-video R@1	47.9	VIOLETv2
Video Retrieval	DiDeMo	text-to-video R@10	84.1	VIOLETv2
Video Retrieval	DiDeMo	text-to-video R@5	76.5	VIOLETv2
Video Retrieval	MSR-VTT	text-to-video R@1	37.2	VIOLETv2
Video Retrieval	MSR-VTT	text-to-video R@10	75.8	VIOLETv2
Video Retrieval	MSR-VTT	text-to-video R@5	64.8	VIOLETv2
Video Retrieval	LSMDC	text-to-video R@1	24	VIOLETv2
Video Retrieval	LSMDC	text-to-video R@10	54.1	VIOLETv2
Video Retrieval	LSMDC	text-to-video R@5	43.5	VIOLETv2

An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling

Abstract

Results

Related Papers

An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling

Abstract

Results

Related Papers