Qinghao Ye, Guohai Xu, Ming Yan, Haiyang Xu, Qi Qian, Ji Zhang, Fei Huang
Video-language pre-training has advanced the performance of various downstream video-language tasks. However, most previous methods directly inherit or adapt typical image-language pre-training paradigms to video-language pre-training, thus not fully exploiting the unique characteristic of video, i.e., temporal. In this paper, we propose a Hierarchical Temporal-Aware video-language pre-training framework, HiTeA, with two novel pre-training tasks for modeling cross-modal alignment between moments and texts as well as the temporal relations of video-text pairs. Specifically, we propose a cross-modal moment exploration task to explore moments in videos, which results in detailed video moment representation. Besides, the inherent temporal relations are captured by aligning video-text pairs as a whole in different time resolutions with multi-modal temporal relation exploration task. Furthermore, we introduce the shuffling test to evaluate the temporal reliance of datasets and video-language pre-training models. We achieve state-of-the-art results on 15 well-established video-language understanding and generation tasks, especially on temporal-oriented datasets (e.g., SSv2-Template and SSv2-Label) with 8.6% and 11.1% improvement respectively. HiTeA also demonstrates strong generalization ability when directly transferred to downstream tasks in a zero-shot manner. Models and demo will be available on ModelScope.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | MSR-VTT-1kA | text-to-video R@1 | 46.8 | HiTeA |
| Video | MSR-VTT-1kA | text-to-video R@10 | 81.9 | HiTeA |
| Video | MSR-VTT-1kA | text-to-video R@5 | 71.2 | HiTeA |
| Video | SSv2-template retrieval | text-to-video R@1 | 85.6 | HiTeA |
| Video | SSv2-template retrieval | text-to-video R@10 | 100 | HiTeA |
| Video | SSv2-template retrieval | text-to-video R@5 | 100 | HiTeA |
| Video | ActivityNet | text-to-video R@1 | 49.7 | HiTeA |
| Video | ActivityNet | text-to-video R@10 | 86.7 | HiTeA |
| Video | ActivityNet | text-to-video R@5 | 77.1 | HiTeA |
| Video | SSv2-label retrieval | text-to-video R@1 | 55.2 | HiTeA |
| Video | SSv2-label retrieval | text-to-video R@10 | 81.4 | HiTeA |
| Video | SSv2-label retrieval | text-to-video R@5 | 89.1 | HiTeA |
| Video | DiDeMo | text-to-video R@1 | 56.5 | HiTeA |
| Video | DiDeMo | text-to-video R@10 | 89.7 | HiTeA |
| Video | DiDeMo | text-to-video R@5 | 81.7 | HiTeA |
| Video | LSMDC | text-to-video R@1 | 28.7 | HiTeA |
| Video | LSMDC | text-to-video R@10 | 59 | HiTeA |
| Video | LSMDC | text-to-video R@5 | 50.3 | HiTeA |
| Zero-Shot Learning | MSRVTT-QA | Accuracy | 21.7 | HiTeA |
| Zero-Shot Learning | MSVD-QA | Accuracy | 37.4 | HiTeA |
| Visual Question Answering (VQA) | MSRVTT-QA | Accuracy | 0.459 | HiTeA |
| Visual Question Answering (VQA) | MSVD-QA | Accuracy | 0.556 | HiTeA |
| Visual Question Answering (VQA) | TGIF-QA | Accuracy | 0.732 | HiTeA |
| Video Question Answering | NExT-QA | Accuracy | 63.1 | HiTeA |
| Video Question Answering | MSRVTT-MC | Accuracy | 97.4 | HiTeA |
| Video Captioning | MSR-VTT | BLEU-4 | 49.2 | HiTeA |
| Video Captioning | MSR-VTT | CIDEr | 65.1 | HiTeA |
| Video Captioning | MSR-VTT | METEOR | 30.7 | HiTeA |
| Video Captioning | MSR-VTT | ROUGE-L | 65 | HiTeA |
| Video Captioning | MSVD | BLEU-4 | 71 | HiTeA |
| Video Captioning | MSVD | CIDEr | 146.9 | HiTeA |
| Video Captioning | MSVD | METEOR | 45.3 | HiTeA |
| Video Captioning | MSVD | ROUGE-L | 81.4 | HiTeA |
| Video Retrieval | MSR-VTT-1kA | text-to-video R@1 | 46.8 | HiTeA |
| Video Retrieval | MSR-VTT-1kA | text-to-video R@10 | 81.9 | HiTeA |
| Video Retrieval | MSR-VTT-1kA | text-to-video R@5 | 71.2 | HiTeA |
| Video Retrieval | SSv2-template retrieval | text-to-video R@1 | 85.6 | HiTeA |
| Video Retrieval | SSv2-template retrieval | text-to-video R@10 | 100 | HiTeA |
| Video Retrieval | SSv2-template retrieval | text-to-video R@5 | 100 | HiTeA |
| Video Retrieval | ActivityNet | text-to-video R@1 | 49.7 | HiTeA |
| Video Retrieval | ActivityNet | text-to-video R@10 | 86.7 | HiTeA |
| Video Retrieval | ActivityNet | text-to-video R@5 | 77.1 | HiTeA |
| Video Retrieval | SSv2-label retrieval | text-to-video R@1 | 55.2 | HiTeA |
| Video Retrieval | SSv2-label retrieval | text-to-video R@10 | 81.4 | HiTeA |
| Video Retrieval | SSv2-label retrieval | text-to-video R@5 | 89.1 | HiTeA |
| Video Retrieval | DiDeMo | text-to-video R@1 | 56.5 | HiTeA |
| Video Retrieval | DiDeMo | text-to-video R@10 | 89.7 | HiTeA |
| Video Retrieval | DiDeMo | text-to-video R@5 | 81.7 | HiTeA |
| Video Retrieval | LSMDC | text-to-video R@1 | 28.7 | HiTeA |
| Video Retrieval | LSMDC | text-to-video R@10 | 59 | HiTeA |
| Video Retrieval | LSMDC | text-to-video R@5 | 50.3 | HiTeA |
| Zero-Shot Video Retrieval | MSR-VTT | text-to-video R@1 | 34.4 | HiTeA-17M |
| Zero-Shot Video Retrieval | MSR-VTT | text-to-video R@10 | 69.9 | HiTeA-17M |
| Zero-Shot Video Retrieval | MSR-VTT | text-to-video R@5 | 60 | HiTeA-17M |
| Zero-Shot Video Retrieval | MSR-VTT | text-to-video R@1 | 29.9 | HiTeA-5M |
| Zero-Shot Video Retrieval | MSR-VTT | text-to-video R@10 | 62.9 | HiTeA-5M |
| Zero-Shot Video Retrieval | MSR-VTT | text-to-video R@5 | 54.2 | HiTeA-5M |
| Zero-Shot Video Retrieval | DiDeMo | text-to-video R@1 | 43.2 | HiTeA-17M |
| Zero-Shot Video Retrieval | DiDeMo | text-to-video R@10 | 79 | HiTeA-17M |
| Zero-Shot Video Retrieval | DiDeMo | text-to-video R@5 | 69.3 | HiTeA-17M |
| Zero-Shot Video Retrieval | DiDeMo | text-to-video R@1 | 36.1 | HiTeA-5M |
| Zero-Shot Video Retrieval | DiDeMo | text-to-video R@10 | 70.3 | HiTeA-5M |
| Zero-Shot Video Retrieval | DiDeMo | text-to-video R@5 | 60.1 | HiTeA-5M |
| Zero-Shot Video Retrieval | LSMDC | text-to-video R@1 | 18.3 | HiTeA-17M |
| Zero-Shot Video Retrieval | LSMDC | text-to-video R@10 | 44.2 | HiTeA-17M |
| Zero-Shot Video Retrieval | LSMDC | text-to-video R@5 | 36.7 | HiTeA-17M |
| Zero-Shot Video Retrieval | LSMDC | text-to-video R@1 | 15.5 | HiTeA-5M |
| Zero-Shot Video Retrieval | LSMDC | text-to-video R@10 | 39.8 | HiTeA-5M |
| Zero-Shot Video Retrieval | LSMDC | text-to-video R@5 | 31.1 | HiTeA-5M |