VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending

Xingjian He, Sihan Chen, Fan Ma, Zhicheng Huang, Xiaojie Jin, Zikang Liu, Dongmei Fu, Yi Yang, Jing Liu, Jiashi Feng

2023-05-22Question Answering Video Retrieval Video-Text Retrieval Text Retrieval Video Question Answering Video Captioning Retrieval Visual Question Answering (VQA)TGIF-Frame

Paper PDF

Abstract

Large-scale image-text contrastive pre-training models, such as CLIP, have been demonstrated to effectively learn high-quality multimodal representations. However, there is limited research on learning video-text representations for general video multimodal tasks based on these powerful features. Towards this goal, we propose a novel video-text pre-training method dubbed VLAB: Video Language pre-training by feature Adapting and Blending, which transfers CLIP representations to video pre-training tasks and develops unified video multimodal models for a wide range of video-text tasks. Specifically, VLAB is founded on two key strategies: feature adapting and feature blending. In the former, we introduce a new video adapter module to address CLIP's deficiency in modeling temporal information and extend the model's capability to encompass both contrastive and generative tasks. In the latter, we propose an end-to-end training method that further enhances the model's performance by exploiting the complementarity of image and video features. We validate the effectiveness and versatility of VLAB through extensive experiments on highly competitive video multimodal tasks, including video text retrieval, video captioning, and video question answering. Remarkably, VLAB outperforms competing methods significantly and sets new records in video question answering on MSRVTT, MSVD, and TGIF datasets. It achieves an accuracy of 49.6, 61.0, and 79.0, respectively. Codes and models will be released.

Results

Task	Dataset	Metric	Value	Model
Video	DiDeMo	text-to-video R@1	56.8	VLAB
Video	DiDeMo	text-to-video R@10	88.7	VLAB
Video	DiDeMo	text-to-video R@5	81.6	VLAB
Video	MSR-VTT	text-to-video R@1	55.1	VLAB
Video	MSR-VTT	text-to-video R@10	87.6	VLAB
Video	MSR-VTT	text-to-video R@5	78.8	VLAB
Video	MSVD	text-to-video R@1	57.5	VLAB
Video	MSVD	text-to-video R@10	89.9	VLAB
Video	MSVD	text-to-video R@5	83.6	VLAB
Visual Question Answering (VQA)	MSRVTT-QA	Accuracy	0.496	VLAB
Visual Question Answering (VQA)	MSVD-QA	Accuracy	0.61	VLAB
Video Captioning	MSR-VTT	BLEU-4	54.6	VLAB
Video Captioning	MSR-VTT	CIDEr	74.9	VLAB
Video Captioning	MSR-VTT	METEOR	33.4	VLAB
Video Captioning	MSR-VTT	ROUGE-L	68.3	VLAB
Video Captioning	MSVD	BLEU-4	79.3	VLAB
Video Captioning	MSVD	CIDEr	179.8	VLAB
Video Captioning	MSVD	METEOR	51.2	VLAB
Video Captioning	MSVD	ROUGE-L	87.9	VLAB
Video Retrieval	DiDeMo	text-to-video R@1	56.8	VLAB
Video Retrieval	DiDeMo	text-to-video R@10	88.7	VLAB
Video Retrieval	DiDeMo	text-to-video R@5	81.6	VLAB
Video Retrieval	MSR-VTT	text-to-video R@1	55.1	VLAB
Video Retrieval	MSR-VTT	text-to-video R@10	87.6	VLAB
Video Retrieval	MSR-VTT	text-to-video R@5	78.8	VLAB
Video Retrieval	MSVD	text-to-video R@1	57.5	VLAB
Video Retrieval	MSVD	text-to-video R@10	89.9	VLAB
Video Retrieval	MSVD	text-to-video R@5	83.6	VLAB

VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending

Abstract

Results

Related Papers

VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending

Abstract

Results

Related Papers