TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Tencent Text-Video Retrieval: Hierarchical Cross-Modal Int...

Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations

Jie Jiang, Shaobo Min, Weijie Kong, Dihong Gong, Hongfa Wang, Zhifeng Li, Wei Liu

2022-04-07DenoisingVideo RetrievalSentence EmbeddingsContrastive LearningRetrieval
PaperPDF

Abstract

Text-Video Retrieval plays an important role in multi-modal understanding and has attracted increasing attention in recent years. Most existing methods focus on constructing contrastive pairs between whole videos and complete caption sentences, while overlooking fine-grained cross-modal relationships, e.g., clip-phrase or frame-word. In this paper, we propose a novel method, named Hierarchical Cross-Modal Interaction (HCMI), to explore multi-level cross-modal relationships among video-sentence, clip-phrase, and frame-word for text-video retrieval. Considering intrinsic semantic frame relations, HCMI performs self-attention to explore frame-level correlations and adaptively cluster correlated frames into clip-level and video-level representations. In this way, HCMI constructs multi-level video representations for frame-clip-video granularities to capture fine-grained video content, and multi-level text representations at word-phrase-sentence granularities for the text modality. With multi-level representations for video and text, hierarchical contrastive learning is designed to explore fine-grained cross-modal relationships, i.e., frame-word, clip-phrase, and video-sentence, which enables HCMI to achieve a comprehensive semantic comparison between video and text modalities. Further boosted by adaptive label denoising and marginal sample enhancement, HCMI achieves new state-of-the-art results on various benchmarks, e.g., Rank@1 of 55.0%, 58.2%, 29.7%, 52.1%, and 57.3% on MSR-VTT, MSVD, LSMDC, DiDemo, and ActivityNet, respectively.

Results

TaskDatasetMetricValueModel
VideoMSR-VTT-1kAtext-to-video Mean Rank9.3HunYuan_tvr (huge)
VideoMSR-VTT-1kAtext-to-video Median Rank1HunYuan_tvr (huge)
VideoMSR-VTT-1kAtext-to-video R@162.9HunYuan_tvr (huge)
VideoMSR-VTT-1kAtext-to-video R@1090.8HunYuan_tvr (huge)
VideoMSR-VTT-1kAtext-to-video R@584.5HunYuan_tvr (huge)
VideoMSR-VTT-1kAvideo-to-text Mean Rank5.5HunYuan_tvr (huge)
VideoMSR-VTT-1kAvideo-to-text Median Rank1HunYuan_tvr (huge)
VideoMSR-VTT-1kAvideo-to-text R@164.8HunYuan_tvr (huge)
VideoMSR-VTT-1kAvideo-to-text R@1091.1HunYuan_tvr (huge)
VideoMSR-VTT-1kAvideo-to-text R@584.9HunYuan_tvr (huge)
VideoMSR-VTT-1kAtext-to-video R@155HunYuan_tvr
VideoMSR-VTT-1kAvideo-to-text Mean Rank7.7HunYuan_tvr
VideoMSR-VTT-1kAvideo-to-text Median Rank1HunYuan_tvr
VideoMSR-VTT-1kAvideo-to-text R@155.5HunYuan_tvr
VideoMSR-VTT-1kAvideo-to-text R@1085.8HunYuan_tvr
VideoMSR-VTT-1kAvideo-to-text R@578.4HunYuan_tvr
VideoActivityNettext-to-video Mean Rank4HunYuan_tvr
VideoActivityNettext-to-video Median Rank1HunYuan_tvr
VideoActivityNettext-to-video R@157.3HunYuan_tvr
VideoActivityNettext-to-video R@1093.1HunYuan_tvr
VideoActivityNettext-to-video R@584.8HunYuan_tvr
VideoActivityNetvideo-to-text Mean Rank3.4HunYuan_tvr
VideoActivityNetvideo-to-text Median Rank1HunYuan_tvr
VideoActivityNetvideo-to-text R@157.7HunYuan_tvr
VideoActivityNetvideo-to-text R@1093.9HunYuan_tvr
VideoActivityNetvideo-to-text R@585.7HunYuan_tvr
VideoDiDeMotext-to-video Mean Rank13.7HunYuan_tvr (huge)
VideoDiDeMotext-to-video Median Rank1HunYuan_tvr (huge)
VideoDiDeMotext-to-video R@152.7HunYuan_tvr (huge)
VideoDiDeMotext-to-video R@1085.2HunYuan_tvr (huge)
VideoDiDeMotext-to-video R@577.8HunYuan_tvr (huge)
VideoDiDeMovideo-to-text Mean Rank9.1HunYuan_tvr (huge)
VideoDiDeMovideo-to-text Median Rank1HunYuan_tvr (huge)
VideoDiDeMovideo-to-text R@154.1HunYuan_tvr (huge)
VideoDiDeMovideo-to-text R@1086.8HunYuan_tvr (huge)
VideoDiDeMovideo-to-text R@578.3HunYuan_tvr (huge)
VideoDiDeMotext-to-video Mean Rank11.1HunYuan_tvr
VideoDiDeMotext-to-video Median Rank1HunYuan_tvr
VideoDiDeMotext-to-video R@152.1HunYuan_tvr
VideoDiDeMotext-to-video R@1085.7HunYuan_tvr
VideoDiDeMotext-to-video R@578.2HunYuan_tvr
VideoDiDeMovideo-to-text Mean Rank7.1HunYuan_tvr
VideoDiDeMovideo-to-text Median Rank1HunYuan_tvr
VideoDiDeMovideo-to-text R@154.8HunYuan_tvr
VideoDiDeMovideo-to-text R@1087.2HunYuan_tvr
VideoDiDeMovideo-to-text R@579.9HunYuan_tvr
VideoLSMDCtext-to-video Mean Rank3.9HunYuan_tvr (huge)
VideoLSMDCtext-to-video Median Rank2HunYuan_tvr (huge)
VideoLSMDCtext-to-video R@140.4HunYuan_tvr (huge)
VideoLSMDCtext-to-video R@1092.8HunYuan_tvr (huge)
VideoLSMDCtext-to-video R@580.1HunYuan_tvr (huge)
VideoLSMDCvideo-to-text Mean Rank4.3HunYuan_tvr (huge)
VideoLSMDCvideo-to-text Median Rank2HunYuan_tvr (huge)
VideoLSMDCvideo-to-text R@134.6HunYuan_tvr (huge)
VideoLSMDCvideo-to-text R@1091.8HunYuan_tvr (huge)
VideoLSMDCvideo-to-text R@571.8HunYuan_tvr (huge)
VideoLSMDCtext-to-video Mean Rank56.4HunYuan_tvr
VideoLSMDCtext-to-video Median Rank7HunYuan_tvr
VideoLSMDCtext-to-video R@129.7HunYuan_tvr
VideoLSMDCtext-to-video R@1055.4HunYuan_tvr
VideoLSMDCtext-to-video R@546.4HunYuan_tvr
VideoLSMDCvideo-to-text Mean Rank48.9HunYuan_tvr
VideoLSMDCvideo-to-text Median Rank7HunYuan_tvr
VideoLSMDCvideo-to-text R@130.1HunYuan_tvr
VideoLSMDCvideo-to-text R@1055.7HunYuan_tvr
VideoLSMDCvideo-to-text R@547.5HunYuan_tvr
VideoMSVDtext-to-video Mean Rank7.6HunYuan_tvr (huge)
VideoMSVDtext-to-video Median Rank1HunYuan_tvr (huge)
VideoMSVDtext-to-video R@159HunYuan_tvr (huge)
VideoMSVDtext-to-video R@1090.3HunYuan_tvr (huge)
VideoMSVDtext-to-video R@584HunYuan_tvr (huge)
VideoMSVDvideo-to-text Mean Rank7.6HunYuan_tvr (huge)
VideoMSVDvideo-to-text Median Rank1HunYuan_tvr (huge)
VideoMSVDvideo-to-text R@173HunYuan_tvr (huge)
VideoMSVDvideo-to-text R@1096.6HunYuan_tvr (huge)
VideoMSVDvideo-to-text R@594.5HunYuan_tvr (huge)
VideoMSVDtext-to-video Mean Rank7.8HunYuan_tvr
VideoMSVDtext-to-video Median Rank1HunYuan_tvr
VideoMSVDtext-to-video R@158.2HunYuan_tvr
VideoMSVDtext-to-video R@1090.1HunYuan_tvr
VideoMSVDtext-to-video R@583.5HunYuan_tvr
VideoMSVDvideo-to-text Mean Rank3.8HunYuan_tvr
VideoMSVDvideo-to-text Median Rank1HunYuan_tvr
VideoMSVDvideo-to-text R@169.1HunYuan_tvr
VideoMSVDvideo-to-text R@1095HunYuan_tvr
VideoMSVDvideo-to-text R@591.5HunYuan_tvr
Video RetrievalMSR-VTT-1kAtext-to-video Mean Rank9.3HunYuan_tvr (huge)
Video RetrievalMSR-VTT-1kAtext-to-video Median Rank1HunYuan_tvr (huge)
Video RetrievalMSR-VTT-1kAtext-to-video R@162.9HunYuan_tvr (huge)
Video RetrievalMSR-VTT-1kAtext-to-video R@1090.8HunYuan_tvr (huge)
Video RetrievalMSR-VTT-1kAtext-to-video R@584.5HunYuan_tvr (huge)
Video RetrievalMSR-VTT-1kAvideo-to-text Mean Rank5.5HunYuan_tvr (huge)
Video RetrievalMSR-VTT-1kAvideo-to-text Median Rank1HunYuan_tvr (huge)
Video RetrievalMSR-VTT-1kAvideo-to-text R@164.8HunYuan_tvr (huge)
Video RetrievalMSR-VTT-1kAvideo-to-text R@1091.1HunYuan_tvr (huge)
Video RetrievalMSR-VTT-1kAvideo-to-text R@584.9HunYuan_tvr (huge)
Video RetrievalMSR-VTT-1kAtext-to-video R@155HunYuan_tvr
Video RetrievalMSR-VTT-1kAvideo-to-text Mean Rank7.7HunYuan_tvr
Video RetrievalMSR-VTT-1kAvideo-to-text Median Rank1HunYuan_tvr
Video RetrievalMSR-VTT-1kAvideo-to-text R@155.5HunYuan_tvr
Video RetrievalMSR-VTT-1kAvideo-to-text R@1085.8HunYuan_tvr
Video RetrievalMSR-VTT-1kAvideo-to-text R@578.4HunYuan_tvr
Video RetrievalActivityNettext-to-video Mean Rank4HunYuan_tvr
Video RetrievalActivityNettext-to-video Median Rank1HunYuan_tvr
Video RetrievalActivityNettext-to-video R@157.3HunYuan_tvr
Video RetrievalActivityNettext-to-video R@1093.1HunYuan_tvr
Video RetrievalActivityNettext-to-video R@584.8HunYuan_tvr
Video RetrievalActivityNetvideo-to-text Mean Rank3.4HunYuan_tvr
Video RetrievalActivityNetvideo-to-text Median Rank1HunYuan_tvr
Video RetrievalActivityNetvideo-to-text R@157.7HunYuan_tvr
Video RetrievalActivityNetvideo-to-text R@1093.9HunYuan_tvr
Video RetrievalActivityNetvideo-to-text R@585.7HunYuan_tvr
Video RetrievalDiDeMotext-to-video Mean Rank13.7HunYuan_tvr (huge)
Video RetrievalDiDeMotext-to-video Median Rank1HunYuan_tvr (huge)
Video RetrievalDiDeMotext-to-video R@152.7HunYuan_tvr (huge)
Video RetrievalDiDeMotext-to-video R@1085.2HunYuan_tvr (huge)
Video RetrievalDiDeMotext-to-video R@577.8HunYuan_tvr (huge)
Video RetrievalDiDeMovideo-to-text Mean Rank9.1HunYuan_tvr (huge)
Video RetrievalDiDeMovideo-to-text Median Rank1HunYuan_tvr (huge)
Video RetrievalDiDeMovideo-to-text R@154.1HunYuan_tvr (huge)
Video RetrievalDiDeMovideo-to-text R@1086.8HunYuan_tvr (huge)
Video RetrievalDiDeMovideo-to-text R@578.3HunYuan_tvr (huge)
Video RetrievalDiDeMotext-to-video Mean Rank11.1HunYuan_tvr
Video RetrievalDiDeMotext-to-video Median Rank1HunYuan_tvr
Video RetrievalDiDeMotext-to-video R@152.1HunYuan_tvr
Video RetrievalDiDeMotext-to-video R@1085.7HunYuan_tvr
Video RetrievalDiDeMotext-to-video R@578.2HunYuan_tvr
Video RetrievalDiDeMovideo-to-text Mean Rank7.1HunYuan_tvr
Video RetrievalDiDeMovideo-to-text Median Rank1HunYuan_tvr
Video RetrievalDiDeMovideo-to-text R@154.8HunYuan_tvr
Video RetrievalDiDeMovideo-to-text R@1087.2HunYuan_tvr
Video RetrievalDiDeMovideo-to-text R@579.9HunYuan_tvr
Video RetrievalLSMDCtext-to-video Mean Rank3.9HunYuan_tvr (huge)
Video RetrievalLSMDCtext-to-video Median Rank2HunYuan_tvr (huge)
Video RetrievalLSMDCtext-to-video R@140.4HunYuan_tvr (huge)
Video RetrievalLSMDCtext-to-video R@1092.8HunYuan_tvr (huge)
Video RetrievalLSMDCtext-to-video R@580.1HunYuan_tvr (huge)
Video RetrievalLSMDCvideo-to-text Mean Rank4.3HunYuan_tvr (huge)
Video RetrievalLSMDCvideo-to-text Median Rank2HunYuan_tvr (huge)
Video RetrievalLSMDCvideo-to-text R@134.6HunYuan_tvr (huge)
Video RetrievalLSMDCvideo-to-text R@1091.8HunYuan_tvr (huge)
Video RetrievalLSMDCvideo-to-text R@571.8HunYuan_tvr (huge)
Video RetrievalLSMDCtext-to-video Mean Rank56.4HunYuan_tvr
Video RetrievalLSMDCtext-to-video Median Rank7HunYuan_tvr
Video RetrievalLSMDCtext-to-video R@129.7HunYuan_tvr
Video RetrievalLSMDCtext-to-video R@1055.4HunYuan_tvr
Video RetrievalLSMDCtext-to-video R@546.4HunYuan_tvr
Video RetrievalLSMDCvideo-to-text Mean Rank48.9HunYuan_tvr
Video RetrievalLSMDCvideo-to-text Median Rank7HunYuan_tvr
Video RetrievalLSMDCvideo-to-text R@130.1HunYuan_tvr
Video RetrievalLSMDCvideo-to-text R@1055.7HunYuan_tvr
Video RetrievalLSMDCvideo-to-text R@547.5HunYuan_tvr
Video RetrievalMSVDtext-to-video Mean Rank7.6HunYuan_tvr (huge)
Video RetrievalMSVDtext-to-video Median Rank1HunYuan_tvr (huge)
Video RetrievalMSVDtext-to-video R@159HunYuan_tvr (huge)
Video RetrievalMSVDtext-to-video R@1090.3HunYuan_tvr (huge)
Video RetrievalMSVDtext-to-video R@584HunYuan_tvr (huge)
Video RetrievalMSVDvideo-to-text Mean Rank7.6HunYuan_tvr (huge)
Video RetrievalMSVDvideo-to-text Median Rank1HunYuan_tvr (huge)
Video RetrievalMSVDvideo-to-text R@173HunYuan_tvr (huge)
Video RetrievalMSVDvideo-to-text R@1096.6HunYuan_tvr (huge)
Video RetrievalMSVDvideo-to-text R@594.5HunYuan_tvr (huge)
Video RetrievalMSVDtext-to-video Mean Rank7.8HunYuan_tvr
Video RetrievalMSVDtext-to-video Median Rank1HunYuan_tvr
Video RetrievalMSVDtext-to-video R@158.2HunYuan_tvr
Video RetrievalMSVDtext-to-video R@1090.1HunYuan_tvr
Video RetrievalMSVDtext-to-video R@583.5HunYuan_tvr
Video RetrievalMSVDvideo-to-text Mean Rank3.8HunYuan_tvr
Video RetrievalMSVDvideo-to-text Median Rank1HunYuan_tvr
Video RetrievalMSVDvideo-to-text R@169.1HunYuan_tvr
Video RetrievalMSVDvideo-to-text R@1095HunYuan_tvr
Video RetrievalMSVDvideo-to-text R@591.5HunYuan_tvr

Related Papers

From Neurons to Semantics: Evaluating Cross-Linguistic Alignment Capabilities of Large Language Models via Neurons Alignment2025-07-20fastWDM3D: Fast and Accurate 3D Healthy Tissue Inpainting2025-07-17Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models2025-07-17SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management2025-07-17SGCL: Unifying Self-Supervised and Supervised Learning for Graph Recommendation2025-07-17From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17