Align and Prompt: Video-and-Language Pre-training with Entity Prompts

Dongxu Li, Junnan Li, Hongdong Li, Juan Carlos Niebles, Steven C. H. Hoi

2021-12-17CVPR 2022 1Video Retrieval Zero-Shot Video Retrieval cross-modal alignment Entity Alignment Retrieval Visual Question Answering (VQA)

Paper PDF Code(official)

Abstract

Video-and-language pre-training has shown promising improvements on various downstream tasks. Most previous methods capture cross-modal interactions with a transformer-based multimodal encoder, not fully addressing the misalignment between unimodal video and text features. Besides, learning fine-grained visual-language alignment usually requires off-the-shelf object detectors to provide object information, which is bottlenecked by the detector's limited vocabulary and expensive computation cost. We propose Align and Prompt: an efficient and effective video-and-language pre-training framework with better cross-modal alignment. First, we introduce a video-text contrastive (VTC) loss to align unimodal video-text features at the instance level, which eases the modeling of cross-modal interactions. Then, we propose a new visually-grounded pre-training task, prompting entity modeling (PEM), which aims to learn fine-grained region-entity alignment. To achieve this, we first introduce an entity prompter module, which is trained with VTC to produce the similarity between a video crop and text prompts instantiated with entity names. The PEM task then asks the model to predict the entity pseudo-labels (i.e~normalized similarity scores) for randomly-selected video crops. The resulting pre-trained model achieves state-of-the-art performance on both text-video retrieval and videoQA, outperforming prior work by a substantial margin. Our code and pre-trained models are available at https://github.com/salesforce/ALPRO.

Results

Task	Dataset	Metric	Value	Model
Video	DiDeMo	text-to-video Median Rank	3	ALPRO
Video	DiDeMo	text-to-video R@1	35.9	ALPRO
Video	DiDeMo	text-to-video R@10	78.8	ALPRO
Video	DiDeMo	text-to-video R@5	67.5	ALPRO
Visual Question Answering (VQA)	MSRVTT-QA	Accuracy	0.421	ALPRO
Visual Question Answering (VQA)	MSVD-QA	Accuracy	0.459	ALPRO
Video Retrieval	DiDeMo	text-to-video Median Rank	3	ALPRO
Video Retrieval	DiDeMo	text-to-video R@1	35.9	ALPRO
Video Retrieval	DiDeMo	text-to-video R@10	78.8	ALPRO
Video Retrieval	DiDeMo	text-to-video R@5	67.5	ALPRO
Zero-Shot Video Retrieval	MSR-VTT	text-to-video Median Rank	8	ALPRO
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@1	24.1	ALPRO
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@10	55.4	ALPRO
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@5	44.7	ALPRO
Zero-Shot Video Retrieval	DiDeMo	text-to-video Median Rank	6	ALPRO
Zero-Shot Video Retrieval	DiDeMo	text-to-video R@1	23.8	ALPRO
Zero-Shot Video Retrieval	DiDeMo	text-to-video R@10	57.9	ALPRO
Zero-Shot Video Retrieval	DiDeMo	text-to-video R@5	47.3	ALPRO

Align and Prompt: Video-and-Language Pre-training with Entity Prompts

Abstract

Results

Related Papers

Align and Prompt: Video-and-Language Pre-training with Entity Prompts

Abstract

Results

Related Papers