TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Disentangled Representation Learning for Text-Video Retrie...

Disentangled Representation Learning for Text-Video Retrieval

Qiang Wang, Yanhao Zhang, Yun Zheng, Pan Pan, Xian-Sheng Hua

2022-03-14Video RetrievalRepresentation LearningRetrieval
PaperPDFCodeCode(official)

Abstract

Cross-modality interaction is a critical component in Text-Video Retrieval (TVR), yet there has been little examination of how different influencing factors for computing interaction affect performance. This paper first studies the interaction paradigm in depth, where we find that its computation can be split into two terms, the interaction contents at different granularity and the matching function to distinguish pairs with the same semantics. We also observe that the single-vector representation and implicit intensive function substantially hinder the optimization. Based on these findings, we propose a disentangled framework to capture a sequential and hierarchical representation. Firstly, considering the natural sequential structure in both text and video inputs, a Weighted Token-wise Interaction (WTI) module is performed to decouple the content and adaptively exploit the pair-wise correlations. This interaction can form a better disentangled manifold for sequential inputs. Secondly, we introduce a Channel DeCorrelation Regularization (CDCR) to minimize the redundancy between the components of the compared vectors, which facilitate learning a hierarchical representation. We demonstrate the effectiveness of the disentangled representation on various benchmarks, e.g., surpassing CLIP4Clip largely by +2.9%, +3.1%, +7.9%, +2.3%, +2.8% and +6.5% R@1 on the MSR-VTT, MSVD, VATEX, LSMDC, AcitivityNet, and DiDeMo, respectively.

Results

TaskDatasetMetricValueModel
VideoMSR-VTT-1kAtext-to-video Mean Rank11.4DRL
VideoMSR-VTT-1kAtext-to-video Median Rank1DRL
VideoMSR-VTT-1kAtext-to-video R@153.3DRL
VideoMSR-VTT-1kAtext-to-video R@1087.6DRL
VideoMSR-VTT-1kAtext-to-video R@580.3DRL
VideoMSR-VTT-1kAvideo-to-text Mean Rank7.6DRL
VideoMSR-VTT-1kAvideo-to-text Median Rank1DRL
VideoMSR-VTT-1kAvideo-to-text R@156.2DRL
VideoMSR-VTT-1kAvideo-to-text R@1087.4DRL
VideoMSR-VTT-1kAvideo-to-text R@579.9DRL
VideoDiDeMotext-to-video Mean Rank11.5DRL
VideoDiDeMotext-to-video Median Rank2DRL
VideoDiDeMotext-to-video R@149DRL
VideoDiDeMotext-to-video R@1084.5DRL
VideoDiDeMotext-to-video R@576.5DRL
VideoDiDeMovideo-to-text Mean Rank7.9DRL
VideoDiDeMovideo-to-text Median Rank2DRL
VideoDiDeMovideo-to-text R@149.9DRL
VideoDiDeMovideo-to-text R@1083.3DRL
Video RetrievalMSR-VTT-1kAtext-to-video Mean Rank11.4DRL
Video RetrievalMSR-VTT-1kAtext-to-video Median Rank1DRL
Video RetrievalMSR-VTT-1kAtext-to-video R@153.3DRL
Video RetrievalMSR-VTT-1kAtext-to-video R@1087.6DRL
Video RetrievalMSR-VTT-1kAtext-to-video R@580.3DRL
Video RetrievalMSR-VTT-1kAvideo-to-text Mean Rank7.6DRL
Video RetrievalMSR-VTT-1kAvideo-to-text Median Rank1DRL
Video RetrievalMSR-VTT-1kAvideo-to-text R@156.2DRL
Video RetrievalMSR-VTT-1kAvideo-to-text R@1087.4DRL
Video RetrievalMSR-VTT-1kAvideo-to-text R@579.9DRL
Video RetrievalDiDeMotext-to-video Mean Rank11.5DRL
Video RetrievalDiDeMotext-to-video Median Rank2DRL
Video RetrievalDiDeMotext-to-video R@149DRL
Video RetrievalDiDeMotext-to-video R@1084.5DRL
Video RetrievalDiDeMotext-to-video R@576.5DRL
Video RetrievalDiDeMovideo-to-text Mean Rank7.9DRL
Video RetrievalDiDeMovideo-to-text Median Rank2DRL
Video RetrievalDiDeMovideo-to-text R@149.9DRL
Video RetrievalDiDeMovideo-to-text R@1083.3DRL

Related Papers

Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper2025-07-20Spectral Bellman Method: Unifying Representation and Exploration in RL2025-07-17Boosting Team Modeling through Tempo-Relational Representation Learning2025-07-17From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17Similarity-Guided Diffusion for Contrastive Sequential Recommendation2025-07-16