VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding

Hu Xu, Gargi Ghosh, Po-Yao Huang, Prahal Arora, Masoumeh Aminzadeh, Christoph Feichtenhofer, Florian Metze, Luke Zettlemoyer

2021-05-20Findings (ACL) 2021 8Action Segmentation Video Retrieval Video Captioning Video Understanding Retrieval Temporal Action Localization Language Modelling

Paper PDF Code(official)

Abstract

We present a simplified, task-agnostic multi-modal pre-training approach that can accept either video or text input, or both for a variety of end tasks. Existing pre-training are task-specific by adopting either a single cross-modal encoder that requires both modalities, limiting their use for retrieval-style end tasks or more complex multitask learning with two unimodal encoders, limiting early cross-modal fusion. We instead introduce new pretraining masking schemes that better mix across modalities (e.g. by forcing masks for text to predict the closest video embeddings) while also maintaining separability (e.g. unimodal predictions are sometimes required, without using all the input). Experimental results show strong performance across a wider range of tasks than any previous methods, often outperforming task-specific pre-training. Code is made available at https://github.com/pytorch/fairseq/tree/main/examples/MMPT.

Results

Task	Dataset	Metric	Value	Model
Video	CrossTask	Recall	46.5	VLM
Video	MSR-VTT-1kA	text-to-video Median Rank	4	VLM
Video	MSR-VTT-1kA	text-to-video R@1	28.1	VLM
Video	MSR-VTT-1kA	text-to-video R@10	67.4	VLM
Video	MSR-VTT-1kA	text-to-video R@5	55.5	VLM
Video	YouCook2	text-to-video Median Rank	4	VLM
Video	YouCook2	text-to-video R@1	27.05	VLM
Video	YouCook2	text-to-video R@10	69.38	VLM
Video	YouCook2	text-to-video R@5	56.88	VLM
Temporal Action Localization	CrossTask	Recall	46.5	VLM
Zero-Shot Learning	CrossTask	Recall	46.5	VLM
Action Localization	CrossTask	Recall	46.5	VLM
Action Localization	COIN	Frame accuracy	68.4	VLM
Video Captioning	YouCook2	BLEU-3	17.78	VLM
Video Captioning	YouCook2	BLEU-4	12.27	VLM
Video Captioning	YouCook2	CIDEr	1.3869	VLM
Video Captioning	YouCook2	METEOR	18.22	VLM
Video Captioning	YouCook2	ROUGE-L	41.51	VLM
Video Retrieval	MSR-VTT-1kA	text-to-video Median Rank	4	VLM
Video Retrieval	MSR-VTT-1kA	text-to-video R@1	28.1	VLM
Video Retrieval	MSR-VTT-1kA	text-to-video R@10	67.4	VLM
Video Retrieval	MSR-VTT-1kA	text-to-video R@5	55.5	VLM
Video Retrieval	YouCook2	text-to-video Median Rank	4	VLM
Video Retrieval	YouCook2	text-to-video R@1	27.05	VLM
Video Retrieval	YouCook2	text-to-video R@10	69.38	VLM
Video Retrieval	YouCook2	text-to-video R@5	56.88	VLM
Action Segmentation	COIN	Frame accuracy	68.4	VLM

VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding

Abstract

Results

Related Papers

VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding

Abstract

Results

Related Papers