TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/HiTeA: Hierarchical Temporal-Aware Video-Language Pre-trai...

HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training

Qinghao Ye, Guohai Xu, Ming Yan, Haiyang Xu, Qi Qian, Ji Zhang, Fei Huang

2022-12-30ICCV 2023 1Video RetrievalZero-Shot Video Retrievalcross-modal alignmentVideo Question AnsweringVideo CaptioningTGIF-TransitionVisual Question Answering (VQA)TGIF-ActionZero-Shot LearningTGIF-Frame
PaperPDF

Abstract

Video-language pre-training has advanced the performance of various downstream video-language tasks. However, most previous methods directly inherit or adapt typical image-language pre-training paradigms to video-language pre-training, thus not fully exploiting the unique characteristic of video, i.e., temporal. In this paper, we propose a Hierarchical Temporal-Aware video-language pre-training framework, HiTeA, with two novel pre-training tasks for modeling cross-modal alignment between moments and texts as well as the temporal relations of video-text pairs. Specifically, we propose a cross-modal moment exploration task to explore moments in videos, which results in detailed video moment representation. Besides, the inherent temporal relations are captured by aligning video-text pairs as a whole in different time resolutions with multi-modal temporal relation exploration task. Furthermore, we introduce the shuffling test to evaluate the temporal reliance of datasets and video-language pre-training models. We achieve state-of-the-art results on 15 well-established video-language understanding and generation tasks, especially on temporal-oriented datasets (e.g., SSv2-Template and SSv2-Label) with 8.6% and 11.1% improvement respectively. HiTeA also demonstrates strong generalization ability when directly transferred to downstream tasks in a zero-shot manner. Models and demo will be available on ModelScope.

Results

TaskDatasetMetricValueModel
VideoMSR-VTT-1kAtext-to-video R@146.8HiTeA
VideoMSR-VTT-1kAtext-to-video R@1081.9HiTeA
VideoMSR-VTT-1kAtext-to-video R@571.2HiTeA
VideoSSv2-template retrievaltext-to-video R@185.6HiTeA
VideoSSv2-template retrievaltext-to-video R@10100HiTeA
VideoSSv2-template retrievaltext-to-video R@5100HiTeA
VideoActivityNettext-to-video R@149.7HiTeA
VideoActivityNettext-to-video R@1086.7HiTeA
VideoActivityNettext-to-video R@577.1HiTeA
VideoSSv2-label retrievaltext-to-video R@155.2HiTeA
VideoSSv2-label retrievaltext-to-video R@1081.4HiTeA
VideoSSv2-label retrievaltext-to-video R@589.1HiTeA
VideoDiDeMotext-to-video R@156.5HiTeA
VideoDiDeMotext-to-video R@1089.7HiTeA
VideoDiDeMotext-to-video R@581.7HiTeA
VideoLSMDCtext-to-video R@128.7HiTeA
VideoLSMDCtext-to-video R@1059HiTeA
VideoLSMDCtext-to-video R@550.3HiTeA
Zero-Shot LearningMSRVTT-QAAccuracy21.7HiTeA
Zero-Shot LearningMSVD-QAAccuracy37.4HiTeA
Visual Question Answering (VQA)MSRVTT-QAAccuracy0.459HiTeA
Visual Question Answering (VQA)MSVD-QAAccuracy0.556HiTeA
Visual Question Answering (VQA)TGIF-QAAccuracy0.732HiTeA
Video Question AnsweringNExT-QAAccuracy63.1HiTeA
Video Question AnsweringMSRVTT-MCAccuracy97.4HiTeA
Video CaptioningMSR-VTTBLEU-449.2HiTeA
Video CaptioningMSR-VTTCIDEr65.1HiTeA
Video CaptioningMSR-VTTMETEOR30.7HiTeA
Video CaptioningMSR-VTTROUGE-L65HiTeA
Video CaptioningMSVDBLEU-471HiTeA
Video CaptioningMSVDCIDEr146.9HiTeA
Video CaptioningMSVDMETEOR45.3HiTeA
Video CaptioningMSVDROUGE-L81.4HiTeA
Video RetrievalMSR-VTT-1kAtext-to-video R@146.8HiTeA
Video RetrievalMSR-VTT-1kAtext-to-video R@1081.9HiTeA
Video RetrievalMSR-VTT-1kAtext-to-video R@571.2HiTeA
Video RetrievalSSv2-template retrievaltext-to-video R@185.6HiTeA
Video RetrievalSSv2-template retrievaltext-to-video R@10100HiTeA
Video RetrievalSSv2-template retrievaltext-to-video R@5100HiTeA
Video RetrievalActivityNettext-to-video R@149.7HiTeA
Video RetrievalActivityNettext-to-video R@1086.7HiTeA
Video RetrievalActivityNettext-to-video R@577.1HiTeA
Video RetrievalSSv2-label retrievaltext-to-video R@155.2HiTeA
Video RetrievalSSv2-label retrievaltext-to-video R@1081.4HiTeA
Video RetrievalSSv2-label retrievaltext-to-video R@589.1HiTeA
Video RetrievalDiDeMotext-to-video R@156.5HiTeA
Video RetrievalDiDeMotext-to-video R@1089.7HiTeA
Video RetrievalDiDeMotext-to-video R@581.7HiTeA
Video RetrievalLSMDCtext-to-video R@128.7HiTeA
Video RetrievalLSMDCtext-to-video R@1059HiTeA
Video RetrievalLSMDCtext-to-video R@550.3HiTeA
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@134.4HiTeA-17M
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@1069.9HiTeA-17M
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@560HiTeA-17M
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@129.9HiTeA-5M
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@1062.9HiTeA-5M
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@554.2HiTeA-5M
Zero-Shot Video RetrievalDiDeMotext-to-video R@143.2HiTeA-17M
Zero-Shot Video RetrievalDiDeMotext-to-video R@1079HiTeA-17M
Zero-Shot Video RetrievalDiDeMotext-to-video R@569.3HiTeA-17M
Zero-Shot Video RetrievalDiDeMotext-to-video R@136.1HiTeA-5M
Zero-Shot Video RetrievalDiDeMotext-to-video R@1070.3HiTeA-5M
Zero-Shot Video RetrievalDiDeMotext-to-video R@560.1HiTeA-5M
Zero-Shot Video RetrievalLSMDCtext-to-video R@118.3HiTeA-17M
Zero-Shot Video RetrievalLSMDCtext-to-video R@1044.2HiTeA-17M
Zero-Shot Video RetrievalLSMDCtext-to-video R@536.7HiTeA-17M
Zero-Shot Video RetrievalLSMDCtext-to-video R@115.5HiTeA-5M
Zero-Shot Video RetrievalLSMDCtext-to-video R@1039.8HiTeA-5M
Zero-Shot Video RetrievalLSMDCtext-to-video R@531.1HiTeA-5M

Related Papers

Transformer-based Spatial Grounding: A Comprehensive Survey2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17GLAD: Generalizable Tuning for Vision-Language Models2025-07-17MGFFD-VLM: Multi-Granularity Prompt Learning for Face Forgery Detection with VLM2025-07-16Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16CATVis: Context-Aware Thought Visualization2025-07-15Bridge Feature Matching and Cross-Modal Alignment with Mutual-filtering for Zero-shot Anomaly Detection2025-07-15UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks2025-07-15