HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training

Qinghao Ye, Guohai Xu, Ming Yan, Haiyang Xu, Qi Qian, Ji Zhang, Fei Huang

2022-12-30ICCV 2023 1Video Retrieval Zero-Shot Video Retrieval cross-modal alignment Video Question Answering Video Captioning TGIF-Transition Visual Question Answering (VQA)TGIF-Action Zero-Shot Learning TGIF-Frame

Paper PDF

Abstract

Video-language pre-training has advanced the performance of various downstream video-language tasks. However, most previous methods directly inherit or adapt typical image-language pre-training paradigms to video-language pre-training, thus not fully exploiting the unique characteristic of video, i.e., temporal. In this paper, we propose a Hierarchical Temporal-Aware video-language pre-training framework, HiTeA, with two novel pre-training tasks for modeling cross-modal alignment between moments and texts as well as the temporal relations of video-text pairs. Specifically, we propose a cross-modal moment exploration task to explore moments in videos, which results in detailed video moment representation. Besides, the inherent temporal relations are captured by aligning video-text pairs as a whole in different time resolutions with multi-modal temporal relation exploration task. Furthermore, we introduce the shuffling test to evaluate the temporal reliance of datasets and video-language pre-training models. We achieve state-of-the-art results on 15 well-established video-language understanding and generation tasks, especially on temporal-oriented datasets (e.g., SSv2-Template and SSv2-Label) with 8.6% and 11.1% improvement respectively. HiTeA also demonstrates strong generalization ability when directly transferred to downstream tasks in a zero-shot manner. Models and demo will be available on ModelScope.

Results

Task	Dataset	Metric	Value	Model
Video	MSR-VTT-1kA	text-to-video R@1	46.8	HiTeA
Video	MSR-VTT-1kA	text-to-video R@10	81.9	HiTeA
Video	MSR-VTT-1kA	text-to-video R@5	71.2	HiTeA
Video	SSv2-template retrieval	text-to-video R@1	85.6	HiTeA
Video	SSv2-template retrieval	text-to-video R@10	100	HiTeA
Video	SSv2-template retrieval	text-to-video R@5	100	HiTeA
Video	ActivityNet	text-to-video R@1	49.7	HiTeA
Video	ActivityNet	text-to-video R@10	86.7	HiTeA
Video	ActivityNet	text-to-video R@5	77.1	HiTeA
Video	SSv2-label retrieval	text-to-video R@1	55.2	HiTeA
Video	SSv2-label retrieval	text-to-video R@10	81.4	HiTeA
Video	SSv2-label retrieval	text-to-video R@5	89.1	HiTeA
Video	DiDeMo	text-to-video R@1	56.5	HiTeA
Video	DiDeMo	text-to-video R@10	89.7	HiTeA
Video	DiDeMo	text-to-video R@5	81.7	HiTeA
Video	LSMDC	text-to-video R@1	28.7	HiTeA
Video	LSMDC	text-to-video R@10	59	HiTeA
Video	LSMDC	text-to-video R@5	50.3	HiTeA
Zero-Shot Learning	MSRVTT-QA	Accuracy	21.7	HiTeA
Zero-Shot Learning	MSVD-QA	Accuracy	37.4	HiTeA
Visual Question Answering (VQA)	MSRVTT-QA	Accuracy	0.459	HiTeA
Visual Question Answering (VQA)	MSVD-QA	Accuracy	0.556	HiTeA
Visual Question Answering (VQA)	TGIF-QA	Accuracy	0.732	HiTeA
Video Question Answering	NExT-QA	Accuracy	63.1	HiTeA
Video Question Answering	MSRVTT-MC	Accuracy	97.4	HiTeA
Video Captioning	MSR-VTT	BLEU-4	49.2	HiTeA
Video Captioning	MSR-VTT	CIDEr	65.1	HiTeA
Video Captioning	MSR-VTT	METEOR	30.7	HiTeA
Video Captioning	MSR-VTT	ROUGE-L	65	HiTeA
Video Captioning	MSVD	BLEU-4	71	HiTeA
Video Captioning	MSVD	CIDEr	146.9	HiTeA
Video Captioning	MSVD	METEOR	45.3	HiTeA
Video Captioning	MSVD	ROUGE-L	81.4	HiTeA
Video Retrieval	MSR-VTT-1kA	text-to-video R@1	46.8	HiTeA
Video Retrieval	MSR-VTT-1kA	text-to-video R@10	81.9	HiTeA
Video Retrieval	MSR-VTT-1kA	text-to-video R@5	71.2	HiTeA
Video Retrieval	SSv2-template retrieval	text-to-video R@1	85.6	HiTeA
Video Retrieval	SSv2-template retrieval	text-to-video R@10	100	HiTeA
Video Retrieval	SSv2-template retrieval	text-to-video R@5	100	HiTeA
Video Retrieval	ActivityNet	text-to-video R@1	49.7	HiTeA
Video Retrieval	ActivityNet	text-to-video R@10	86.7	HiTeA
Video Retrieval	ActivityNet	text-to-video R@5	77.1	HiTeA
Video Retrieval	SSv2-label retrieval	text-to-video R@1	55.2	HiTeA
Video Retrieval	SSv2-label retrieval	text-to-video R@10	81.4	HiTeA
Video Retrieval	SSv2-label retrieval	text-to-video R@5	89.1	HiTeA
Video Retrieval	DiDeMo	text-to-video R@1	56.5	HiTeA
Video Retrieval	DiDeMo	text-to-video R@10	89.7	HiTeA
Video Retrieval	DiDeMo	text-to-video R@5	81.7	HiTeA
Video Retrieval	LSMDC	text-to-video R@1	28.7	HiTeA
Video Retrieval	LSMDC	text-to-video R@10	59	HiTeA
Video Retrieval	LSMDC	text-to-video R@5	50.3	HiTeA
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@1	34.4	HiTeA-17M
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@10	69.9	HiTeA-17M
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@5	60	HiTeA-17M
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@1	29.9	HiTeA-5M
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@10	62.9	HiTeA-5M
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@5	54.2	HiTeA-5M
Zero-Shot Video Retrieval	DiDeMo	text-to-video R@1	43.2	HiTeA-17M
Zero-Shot Video Retrieval	DiDeMo	text-to-video R@10	79	HiTeA-17M
Zero-Shot Video Retrieval	DiDeMo	text-to-video R@5	69.3	HiTeA-17M
Zero-Shot Video Retrieval	DiDeMo	text-to-video R@1	36.1	HiTeA-5M
Zero-Shot Video Retrieval	DiDeMo	text-to-video R@10	70.3	HiTeA-5M
Zero-Shot Video Retrieval	DiDeMo	text-to-video R@5	60.1	HiTeA-5M
Zero-Shot Video Retrieval	LSMDC	text-to-video R@1	18.3	HiTeA-17M
Zero-Shot Video Retrieval	LSMDC	text-to-video R@10	44.2	HiTeA-17M
Zero-Shot Video Retrieval	LSMDC	text-to-video R@5	36.7	HiTeA-17M
Zero-Shot Video Retrieval	LSMDC	text-to-video R@1	15.5	HiTeA-5M
Zero-Shot Video Retrieval	LSMDC	text-to-video R@10	39.8	HiTeA-5M
Zero-Shot Video Retrieval	LSMDC	text-to-video R@5	31.1	HiTeA-5M

Abstract

Results

Task	Dataset	Metric	Value	Model
Video	MSR-VTT-1kA	text-to-video R@1	46.8	HiTeA
Video	MSR-VTT-1kA	text-to-video R@10	81.9	HiTeA
Video	MSR-VTT-1kA	text-to-video R@5	71.2	HiTeA
Video	SSv2-template retrieval	text-to-video R@1	85.6	HiTeA
Video	SSv2-template retrieval	text-to-video R@10	100	HiTeA
Video	SSv2-template retrieval	text-to-video R@5	100	HiTeA
Video	ActivityNet	text-to-video R@1	49.7	HiTeA
Video	ActivityNet	text-to-video R@10	86.7	HiTeA
Video	ActivityNet	text-to-video R@5	77.1	HiTeA
Video	SSv2-label retrieval	text-to-video R@1	55.2	HiTeA
Video	SSv2-label retrieval	text-to-video R@10	81.4	HiTeA
Video	SSv2-label retrieval	text-to-video R@5	89.1	HiTeA
Video	DiDeMo	text-to-video R@1	56.5	HiTeA
Video	DiDeMo	text-to-video R@10	89.7	HiTeA
Video	DiDeMo	text-to-video R@5	81.7	HiTeA
Video	LSMDC	text-to-video R@1	28.7	HiTeA
Video	LSMDC	text-to-video R@10	59	HiTeA
Video	LSMDC	text-to-video R@5	50.3	HiTeA
Zero-Shot Learning	MSRVTT-QA	Accuracy	21.7	HiTeA
Zero-Shot Learning	MSVD-QA	Accuracy	37.4	HiTeA
Visual Question Answering (VQA)	MSRVTT-QA	Accuracy	0.459	HiTeA
Visual Question Answering (VQA)	MSVD-QA	Accuracy	0.556	HiTeA
Visual Question Answering (VQA)	TGIF-QA	Accuracy	0.732	HiTeA
Video Question Answering	NExT-QA	Accuracy	63.1	HiTeA
Video Question Answering	MSRVTT-MC	Accuracy	97.4	HiTeA
Video Captioning	MSR-VTT	BLEU-4	49.2	HiTeA
Video Captioning	MSR-VTT	CIDEr	65.1	HiTeA
Video Captioning	MSR-VTT	METEOR	30.7	HiTeA
Video Captioning	MSR-VTT	ROUGE-L	65	HiTeA
Video Captioning	MSVD	BLEU-4	71	HiTeA
Video Captioning	MSVD	CIDEr	146.9	HiTeA
Video Captioning	MSVD	METEOR	45.3	HiTeA
Video Captioning	MSVD	ROUGE-L	81.4	HiTeA
Video Retrieval	MSR-VTT-1kA	text-to-video R@1	46.8	HiTeA
Video Retrieval	MSR-VTT-1kA	text-to-video R@10	81.9	HiTeA
Video Retrieval	MSR-VTT-1kA	text-to-video R@5	71.2	HiTeA
Video Retrieval	SSv2-template retrieval	text-to-video R@1	85.6	HiTeA
Video Retrieval	SSv2-template retrieval	text-to-video R@10	100	HiTeA
Video Retrieval	SSv2-template retrieval	text-to-video R@5	100	HiTeA
Video Retrieval	ActivityNet	text-to-video R@1	49.7	HiTeA
Video Retrieval	ActivityNet	text-to-video R@10	86.7	HiTeA
Video Retrieval	ActivityNet	text-to-video R@5	77.1	HiTeA
Video Retrieval	SSv2-label retrieval	text-to-video R@1	55.2	HiTeA
Video Retrieval	SSv2-label retrieval	text-to-video R@10	81.4	HiTeA
Video Retrieval	SSv2-label retrieval	text-to-video R@5	89.1	HiTeA
Video Retrieval	DiDeMo	text-to-video R@1	56.5	HiTeA
Video Retrieval	DiDeMo	text-to-video R@10	89.7	HiTeA
Video Retrieval	DiDeMo	text-to-video R@5	81.7	HiTeA
Video Retrieval	LSMDC	text-to-video R@1	28.7	HiTeA
Video Retrieval	LSMDC	text-to-video R@10	59	HiTeA
Video Retrieval	LSMDC	text-to-video R@5	50.3	HiTeA
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@1	34.4	HiTeA-17M
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@10	69.9	HiTeA-17M
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@5	60	HiTeA-17M
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@1	29.9	HiTeA-5M
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@10	62.9	HiTeA-5M
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@5	54.2	HiTeA-5M
Zero-Shot Video Retrieval	DiDeMo	text-to-video R@1	43.2	HiTeA-17M
Zero-Shot Video Retrieval	DiDeMo	text-to-video R@10	79	HiTeA-17M
Zero-Shot Video Retrieval	DiDeMo	text-to-video R@5	69.3	HiTeA-17M
Zero-Shot Video Retrieval	DiDeMo	text-to-video R@1	36.1	HiTeA-5M
Zero-Shot Video Retrieval	DiDeMo	text-to-video R@10	70.3	HiTeA-5M
Zero-Shot Video Retrieval	DiDeMo	text-to-video R@5	60.1	HiTeA-5M
Zero-Shot Video Retrieval	LSMDC	text-to-video R@1	18.3	HiTeA-17M
Zero-Shot Video Retrieval	LSMDC	text-to-video R@10	44.2	HiTeA-17M
Zero-Shot Video Retrieval	LSMDC	text-to-video R@5	36.7	HiTeA-17M
Zero-Shot Video Retrieval	LSMDC	text-to-video R@1	15.5	HiTeA-5M
Zero-Shot Video Retrieval	LSMDC	text-to-video R@10	39.8	HiTeA-5M
Zero-Shot Video Retrieval	LSMDC	text-to-video R@5	31.1	HiTeA-5M

HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training

Abstract

Results

Related Papers

HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training

Abstract

Results

Related Papers