Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions

Hongwei Xue, Tiankai Hang, Yanhong Zeng, Yuchong Sun, Bei Liu, Huan Yang, Jianlong Fu, Baining Guo

2021-11-19CVPR 2022 1Super-Resolution Video Retrieval Vocal Bursts Intensity Prediction Zero-Shot Video Retrieval Text to Video Retrieval Retrieval

Paper PDF Code(official)

Abstract

We study joint video and language (VL) pre-training to enable cross-modality learning and benefit plentiful downstream VL tasks. Existing works either extract low-quality video features or learn limited text embedding, while neglecting that high-resolution videos and diversified semantics can significantly improve cross-modality learning. In this paper, we propose a novel High-resolution and Diversified VIdeo-LAnguage pre-training model (HD-VILA) for many visual tasks. In particular, we collect a large dataset with two distinct properties: 1) the first high-resolution dataset including 371.5k hours of 720p videos, and 2) the most diversified dataset covering 15 popular YouTube categories. To enable VL pre-training, we jointly optimize the HD-VILA model by a hybrid Transformer that learns rich spatiotemporal features, and a multimodal Transformer that enforces interactions of the learned video features with diversified texts. Our pre-training model achieves new state-of-the-art results in 10 VL understanding tasks and 2 more novel text-to-visual generation tasks. For example, we outperform SOTA models with relative increases of 40.4% R@1 in zero-shot MSR-VTT text-to-video retrieval task and 55.4% in high-resolution dataset LSMDC. The learned VL embedding is also effective in generating visually pleasing and semantically relevant results in text-to-visual editing and super-resolution tasks.

Results

Task	Dataset	Metric	Value	Model
Video	ActivityNet	text-to-video Median Rank	4	HD-VILA
Video	ActivityNet	text-to-video R@1	28.5	HD-VILA
Video	ActivityNet	text-to-video R@5	57.4	HD-VILA
Video	ActivityNet	text-to-video R@50	94	HD-VILA
Video	DiDeMo	text-to-video Median Rank	4	HD-VILA
Video	DiDeMo	text-to-video R@1	28.8	HD-VILA
Video	DiDeMo	text-to-video R@10	69.1	HD-VILA
Video	DiDeMo	text-to-video R@5	57.4	HD-VILA
Video	MSR-VTT	text-to-video MedianR	3	HD-VILA
Video	MSR-VTT	text-to-video R@1	35.6	HD-VILA
Video	MSR-VTT	text-to-video R@10	78	HD-VILA
Video	MSR-VTT	text-to-video R@5	65.3	HD-VILA
Video	LSMDC	text-to-video Median Rank	15	HD-VILA
Video	LSMDC	text-to-video R@1	17.4	HD-VILA
Video	LSMDC	text-to-video R@10	44.1	HD-VILA
Video	LSMDC	text-to-video R@5	34.1	HD-VILA
Video Retrieval	ActivityNet	text-to-video Median Rank	4	HD-VILA
Video Retrieval	ActivityNet	text-to-video R@1	28.5	HD-VILA
Video Retrieval	ActivityNet	text-to-video R@5	57.4	HD-VILA
Video Retrieval	ActivityNet	text-to-video R@50	94	HD-VILA
Video Retrieval	DiDeMo	text-to-video Median Rank	4	HD-VILA
Video Retrieval	DiDeMo	text-to-video R@1	28.8	HD-VILA
Video Retrieval	DiDeMo	text-to-video R@10	69.1	HD-VILA
Video Retrieval	DiDeMo	text-to-video R@5	57.4	HD-VILA
Video Retrieval	MSR-VTT	text-to-video MedianR	3	HD-VILA
Video Retrieval	MSR-VTT	text-to-video R@1	35.6	HD-VILA
Video Retrieval	MSR-VTT	text-to-video R@10	78	HD-VILA
Video Retrieval	MSR-VTT	text-to-video R@5	65.3	HD-VILA
Video Retrieval	LSMDC	text-to-video Median Rank	15	HD-VILA
Video Retrieval	LSMDC	text-to-video R@1	17.4	HD-VILA
Video Retrieval	LSMDC	text-to-video R@10	44.1	HD-VILA
Video Retrieval	LSMDC	text-to-video R@5	34.1	HD-VILA
Zero-Shot Video Retrieval	MSR-VTT	text-to-video Median Rank	15	HD-VILA
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@1	14.6	HD-VILA
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@10	44.1	HD-VILA
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@5	34.4	HD-VILA

Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions

Abstract

Results

Related Papers

Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions

Abstract

Results

Related Papers