VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, Christoph Feichtenhofer

2021-09-28EMNLP 2021 11Action Segmentation Video Retrieval Action Localization Zero-Shot Video Retrieval Long Video Retrieval (Background Removed)Retrieval Temporal Action Localization Temporal Relation Extraction

Paper PDF Code(official)Code

Abstract

We present VideoCLIP, a contrastive approach to pre-train a unified model for zero-shot video and text understanding, without using any labels on downstream tasks. VideoCLIP trains a transformer for video and text by contrasting temporally overlapping positive video-text pairs with hard negatives from nearest neighbor retrieval. Our experiments on a diverse series of downstream tasks, including sequence-level text-video retrieval, VideoQA, token-level action localization, and action segmentation reveal state-of-the-art performance, surpassing prior work, and in some cases even outperforming supervised approaches. Code is made available at https://github.com/pytorch/fairseq/tree/main/examples/MMPT.

Results

Task	Dataset	Metric	Value	Model
Relation Extraction	Vinoground	Group Score	1.2	VideoCLIP
Relation Extraction	Vinoground	Text Score	17	VideoCLIP
Relation Extraction	Vinoground	Video Score	2.8	VideoCLIP
Video	CrossTask	Recall	47.3	VideoCLIP
Video	MSR-VTT-1kA	text-to-video R@1	30.9	VideoCLIP
Video	MSR-VTT-1kA	text-to-video R@10	66.8	VideoCLIP
Video	MSR-VTT-1kA	text-to-video R@5	55.4	VideoCLIP
Video	YouCook2	text-to-video R@1	32.2	VideoCLIP
Video	YouCook2	text-to-video R@10	75	VideoCLIP
Video	YouCook2	text-to-video R@5	62.6	VideoCLIP
Video	YouCook2	text-to-video R@1	22.7	VideoCLIP (zero-shot)
Video	YouCook2	text-to-video R@10	63.1	VideoCLIP (zero-shot)
Video	YouCook2	text-to-video R@5	50.4	VideoCLIP (zero-shot)
Temporal Action Localization	CrossTask	Recall	47.3	VideoCLIP
Zero-Shot Learning	CrossTask	Recall	47.3	VideoCLIP
Action Localization	CrossTask	Recall	47.3	VideoCLIP
Action Localization	COIN	Frame accuracy	68.7	VideoClip
Video Retrieval	MSR-VTT-1kA	text-to-video R@1	30.9	VideoCLIP
Video Retrieval	MSR-VTT-1kA	text-to-video R@10	66.8	VideoCLIP
Video Retrieval	MSR-VTT-1kA	text-to-video R@5	55.4	VideoCLIP
Video Retrieval	YouCook2	text-to-video R@1	32.2	VideoCLIP
Video Retrieval	YouCook2	text-to-video R@10	75	VideoCLIP
Video Retrieval	YouCook2	text-to-video R@5	62.6	VideoCLIP
Video Retrieval	YouCook2	text-to-video R@1	22.7	VideoCLIP (zero-shot)
Video Retrieval	YouCook2	text-to-video R@10	63.1	VideoCLIP (zero-shot)
Video Retrieval	YouCook2	text-to-video R@5	50.4	VideoCLIP (zero-shot)
Action Segmentation	COIN	Frame accuracy	68.7	VideoClip
Temporal Relation Extraction	Vinoground	Group Score	1.2	VideoCLIP
Temporal Relation Extraction	Vinoground	Text Score	17	VideoCLIP
Temporal Relation Extraction	Vinoground	Video Score	2.8	VideoCLIP
Long Video Retrieval (Background Removed)	YouCook2	Cap. Avg. R@1	74.5	VideoCLIP
Long Video Retrieval (Background Removed)	YouCook2	Cap. Avg. R@10	97.9	VideoCLIP
Long Video Retrieval (Background Removed)	YouCook2	Cap. Avg. R@5	94.5	VideoCLIP
Long Video Retrieval (Background Removed)	YouCook2	DTW R@1	56	VideoCLIP
Long Video Retrieval (Background Removed)	YouCook2	DTW R@10	89.9	VideoCLIP
Long Video Retrieval (Background Removed)	YouCook2	DTW R@5	96.3	VideoCLIP
Long Video Retrieval (Background Removed)	YouCook2	OTAM R@1	52.8	VideoCLIP
Long Video Retrieval (Background Removed)	YouCook2	OTAM R@10	89.2	VideoCLIP
Long Video Retrieval (Background Removed)	YouCook2	OTAM R@5	95	VideoCLIP
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@1	10.4	VideoCLIP
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@10	30	VideoCLIP
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@5	22.2	VideoCLIP
Zero-Shot Video Retrieval	DiDeMo	text-to-video R@1	16.6	VideoCLIP
Zero-Shot Video Retrieval	DiDeMo	text-to-video R@5	46.9	VideoCLIP
Zero-Shot Video Retrieval	YouCook2	text-to-video R@1	22.7	VideoCLIP
Zero-Shot Video Retrieval	YouCook2	text-to-video R@10	63.1	VideoCLIP
Zero-Shot Video Retrieval	YouCook2	text-to-video R@5	50.4	VideoCLIP

VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

Abstract

Results

Related Papers

VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

Abstract

Results

Related Papers