UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

Huaishao Luo, Lei Ji, Botian Shi, Haoyang Huang, Nan Duan, Tianrui Li, Jason Li, Taroon Bharti, Ming Zhou

2020-02-15Action Segmentation Video Retrieval Video Captioning Language Modelling

Abstract

With the recent success of the pre-training technique for NLP and image-linguistic tasks, some video-linguistic pre-training works are gradually developed to improve video-text related downstream tasks. However, most of the existing multimodal models are pre-trained for understanding tasks, leading to a pretrain-finetune discrepancy for generation tasks. This paper proposes UniVL: a Unified Video and Language pre-training model for both multimodal understanding and generation. It comprises four components, including two single-modal encoders, a cross encoder, and a decoder with the Transformer backbone. Five objectives, including video-text joint, conditioned masked language model (CMLM), conditioned masked frame model (CMFM), video-text alignment, and language reconstruction, are designed to train each of the components. We further develop two pre-training strategies, stage by stage pre-training (StagedP) and enhanced video representation (EnhancedV), to make the training process of the UniVL more effective. The pre-train is carried out on a sizeable instructional video dataset HowTo100M. Experimental results demonstrate that the UniVL can learn strong video-text representation and achieves state-of-the-art results on five downstream tasks.

Results

Task	Dataset	Metric	Value	Model
Video	YouCook2	text-to-video Median Rank	4	UniVL
Video	YouCook2	text-to-video R@1	28.9	UniVL
Video	YouCook2	text-to-video R@10	70	UniVL
Video	YouCook2	text-to-video R@5	57.6	UniVL
Video	MSR-VTT	text-to-video Median Rank	6	UniVL
Video	MSR-VTT	text-to-video R@1	21.2	UniVL
Video	MSR-VTT	text-to-video R@10	63.1	UniVL
Video	MSR-VTT	text-to-video R@5	49.6	UniVL
Action Localization	COIN	Frame accuracy	70	Univl
Video Captioning	YouCook2	BLEU-3	23.87	UniVL
Video Captioning	YouCook2	BLEU-4	17.35	UniVL
Video Captioning	YouCook2	CIDEr	1.81	UniVL
Video Captioning	YouCook2	METEOR	22.35	UniVL
Video Captioning	YouCook2	ROUGE-L	46.52	UniVL
Video Retrieval	YouCook2	text-to-video Median Rank	4	UniVL
Video Retrieval	YouCook2	text-to-video R@1	28.9	UniVL
Video Retrieval	YouCook2	text-to-video R@10	70	UniVL
Video Retrieval	YouCook2	text-to-video R@5	57.6	UniVL
Video Retrieval	MSR-VTT	text-to-video Median Rank	6	UniVL
Video Retrieval	MSR-VTT	text-to-video R@1	21.2	UniVL
Video Retrieval	MSR-VTT	text-to-video R@10	63.1	UniVL
Video Retrieval	MSR-VTT	text-to-video R@5	49.6	UniVL
Action Segmentation	COIN	Frame accuracy	70	Univl

UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

Abstract

Results

Related Papers

UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

Abstract

Results

Related Papers