Jie Jiang, Shaobo Min, Weijie Kong, Dihong Gong, Hongfa Wang, Zhifeng Li, Wei Liu
Text-Video Retrieval plays an important role in multi-modal understanding and has attracted increasing attention in recent years. Most existing methods focus on constructing contrastive pairs between whole videos and complete caption sentences, while overlooking fine-grained cross-modal relationships, e.g., clip-phrase or frame-word. In this paper, we propose a novel method, named Hierarchical Cross-Modal Interaction (HCMI), to explore multi-level cross-modal relationships among video-sentence, clip-phrase, and frame-word for text-video retrieval. Considering intrinsic semantic frame relations, HCMI performs self-attention to explore frame-level correlations and adaptively cluster correlated frames into clip-level and video-level representations. In this way, HCMI constructs multi-level video representations for frame-clip-video granularities to capture fine-grained video content, and multi-level text representations at word-phrase-sentence granularities for the text modality. With multi-level representations for video and text, hierarchical contrastive learning is designed to explore fine-grained cross-modal relationships, i.e., frame-word, clip-phrase, and video-sentence, which enables HCMI to achieve a comprehensive semantic comparison between video and text modalities. Further boosted by adaptive label denoising and marginal sample enhancement, HCMI achieves new state-of-the-art results on various benchmarks, e.g., Rank@1 of 55.0%, 58.2%, 29.7%, 52.1%, and 57.3% on MSR-VTT, MSVD, LSMDC, DiDemo, and ActivityNet, respectively.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | MSR-VTT-1kA | text-to-video Mean Rank | 9.3 | HunYuan_tvr (huge) |
| Video | MSR-VTT-1kA | text-to-video Median Rank | 1 | HunYuan_tvr (huge) |
| Video | MSR-VTT-1kA | text-to-video R@1 | 62.9 | HunYuan_tvr (huge) |
| Video | MSR-VTT-1kA | text-to-video R@10 | 90.8 | HunYuan_tvr (huge) |
| Video | MSR-VTT-1kA | text-to-video R@5 | 84.5 | HunYuan_tvr (huge) |
| Video | MSR-VTT-1kA | video-to-text Mean Rank | 5.5 | HunYuan_tvr (huge) |
| Video | MSR-VTT-1kA | video-to-text Median Rank | 1 | HunYuan_tvr (huge) |
| Video | MSR-VTT-1kA | video-to-text R@1 | 64.8 | HunYuan_tvr (huge) |
| Video | MSR-VTT-1kA | video-to-text R@10 | 91.1 | HunYuan_tvr (huge) |
| Video | MSR-VTT-1kA | video-to-text R@5 | 84.9 | HunYuan_tvr (huge) |
| Video | MSR-VTT-1kA | text-to-video R@1 | 55 | HunYuan_tvr |
| Video | MSR-VTT-1kA | video-to-text Mean Rank | 7.7 | HunYuan_tvr |
| Video | MSR-VTT-1kA | video-to-text Median Rank | 1 | HunYuan_tvr |
| Video | MSR-VTT-1kA | video-to-text R@1 | 55.5 | HunYuan_tvr |
| Video | MSR-VTT-1kA | video-to-text R@10 | 85.8 | HunYuan_tvr |
| Video | MSR-VTT-1kA | video-to-text R@5 | 78.4 | HunYuan_tvr |
| Video | ActivityNet | text-to-video Mean Rank | 4 | HunYuan_tvr |
| Video | ActivityNet | text-to-video Median Rank | 1 | HunYuan_tvr |
| Video | ActivityNet | text-to-video R@1 | 57.3 | HunYuan_tvr |
| Video | ActivityNet | text-to-video R@10 | 93.1 | HunYuan_tvr |
| Video | ActivityNet | text-to-video R@5 | 84.8 | HunYuan_tvr |
| Video | ActivityNet | video-to-text Mean Rank | 3.4 | HunYuan_tvr |
| Video | ActivityNet | video-to-text Median Rank | 1 | HunYuan_tvr |
| Video | ActivityNet | video-to-text R@1 | 57.7 | HunYuan_tvr |
| Video | ActivityNet | video-to-text R@10 | 93.9 | HunYuan_tvr |
| Video | ActivityNet | video-to-text R@5 | 85.7 | HunYuan_tvr |
| Video | DiDeMo | text-to-video Mean Rank | 13.7 | HunYuan_tvr (huge) |
| Video | DiDeMo | text-to-video Median Rank | 1 | HunYuan_tvr (huge) |
| Video | DiDeMo | text-to-video R@1 | 52.7 | HunYuan_tvr (huge) |
| Video | DiDeMo | text-to-video R@10 | 85.2 | HunYuan_tvr (huge) |
| Video | DiDeMo | text-to-video R@5 | 77.8 | HunYuan_tvr (huge) |
| Video | DiDeMo | video-to-text Mean Rank | 9.1 | HunYuan_tvr (huge) |
| Video | DiDeMo | video-to-text Median Rank | 1 | HunYuan_tvr (huge) |
| Video | DiDeMo | video-to-text R@1 | 54.1 | HunYuan_tvr (huge) |
| Video | DiDeMo | video-to-text R@10 | 86.8 | HunYuan_tvr (huge) |
| Video | DiDeMo | video-to-text R@5 | 78.3 | HunYuan_tvr (huge) |
| Video | DiDeMo | text-to-video Mean Rank | 11.1 | HunYuan_tvr |
| Video | DiDeMo | text-to-video Median Rank | 1 | HunYuan_tvr |
| Video | DiDeMo | text-to-video R@1 | 52.1 | HunYuan_tvr |
| Video | DiDeMo | text-to-video R@10 | 85.7 | HunYuan_tvr |
| Video | DiDeMo | text-to-video R@5 | 78.2 | HunYuan_tvr |
| Video | DiDeMo | video-to-text Mean Rank | 7.1 | HunYuan_tvr |
| Video | DiDeMo | video-to-text Median Rank | 1 | HunYuan_tvr |
| Video | DiDeMo | video-to-text R@1 | 54.8 | HunYuan_tvr |
| Video | DiDeMo | video-to-text R@10 | 87.2 | HunYuan_tvr |
| Video | DiDeMo | video-to-text R@5 | 79.9 | HunYuan_tvr |
| Video | LSMDC | text-to-video Mean Rank | 3.9 | HunYuan_tvr (huge) |
| Video | LSMDC | text-to-video Median Rank | 2 | HunYuan_tvr (huge) |
| Video | LSMDC | text-to-video R@1 | 40.4 | HunYuan_tvr (huge) |
| Video | LSMDC | text-to-video R@10 | 92.8 | HunYuan_tvr (huge) |
| Video | LSMDC | text-to-video R@5 | 80.1 | HunYuan_tvr (huge) |
| Video | LSMDC | video-to-text Mean Rank | 4.3 | HunYuan_tvr (huge) |
| Video | LSMDC | video-to-text Median Rank | 2 | HunYuan_tvr (huge) |
| Video | LSMDC | video-to-text R@1 | 34.6 | HunYuan_tvr (huge) |
| Video | LSMDC | video-to-text R@10 | 91.8 | HunYuan_tvr (huge) |
| Video | LSMDC | video-to-text R@5 | 71.8 | HunYuan_tvr (huge) |
| Video | LSMDC | text-to-video Mean Rank | 56.4 | HunYuan_tvr |
| Video | LSMDC | text-to-video Median Rank | 7 | HunYuan_tvr |
| Video | LSMDC | text-to-video R@1 | 29.7 | HunYuan_tvr |
| Video | LSMDC | text-to-video R@10 | 55.4 | HunYuan_tvr |
| Video | LSMDC | text-to-video R@5 | 46.4 | HunYuan_tvr |
| Video | LSMDC | video-to-text Mean Rank | 48.9 | HunYuan_tvr |
| Video | LSMDC | video-to-text Median Rank | 7 | HunYuan_tvr |
| Video | LSMDC | video-to-text R@1 | 30.1 | HunYuan_tvr |
| Video | LSMDC | video-to-text R@10 | 55.7 | HunYuan_tvr |
| Video | LSMDC | video-to-text R@5 | 47.5 | HunYuan_tvr |
| Video | MSVD | text-to-video Mean Rank | 7.6 | HunYuan_tvr (huge) |
| Video | MSVD | text-to-video Median Rank | 1 | HunYuan_tvr (huge) |
| Video | MSVD | text-to-video R@1 | 59 | HunYuan_tvr (huge) |
| Video | MSVD | text-to-video R@10 | 90.3 | HunYuan_tvr (huge) |
| Video | MSVD | text-to-video R@5 | 84 | HunYuan_tvr (huge) |
| Video | MSVD | video-to-text Mean Rank | 7.6 | HunYuan_tvr (huge) |
| Video | MSVD | video-to-text Median Rank | 1 | HunYuan_tvr (huge) |
| Video | MSVD | video-to-text R@1 | 73 | HunYuan_tvr (huge) |
| Video | MSVD | video-to-text R@10 | 96.6 | HunYuan_tvr (huge) |
| Video | MSVD | video-to-text R@5 | 94.5 | HunYuan_tvr (huge) |
| Video | MSVD | text-to-video Mean Rank | 7.8 | HunYuan_tvr |
| Video | MSVD | text-to-video Median Rank | 1 | HunYuan_tvr |
| Video | MSVD | text-to-video R@1 | 58.2 | HunYuan_tvr |
| Video | MSVD | text-to-video R@10 | 90.1 | HunYuan_tvr |
| Video | MSVD | text-to-video R@5 | 83.5 | HunYuan_tvr |
| Video | MSVD | video-to-text Mean Rank | 3.8 | HunYuan_tvr |
| Video | MSVD | video-to-text Median Rank | 1 | HunYuan_tvr |
| Video | MSVD | video-to-text R@1 | 69.1 | HunYuan_tvr |
| Video | MSVD | video-to-text R@10 | 95 | HunYuan_tvr |
| Video | MSVD | video-to-text R@5 | 91.5 | HunYuan_tvr |
| Video Retrieval | MSR-VTT-1kA | text-to-video Mean Rank | 9.3 | HunYuan_tvr (huge) |
| Video Retrieval | MSR-VTT-1kA | text-to-video Median Rank | 1 | HunYuan_tvr (huge) |
| Video Retrieval | MSR-VTT-1kA | text-to-video R@1 | 62.9 | HunYuan_tvr (huge) |
| Video Retrieval | MSR-VTT-1kA | text-to-video R@10 | 90.8 | HunYuan_tvr (huge) |
| Video Retrieval | MSR-VTT-1kA | text-to-video R@5 | 84.5 | HunYuan_tvr (huge) |
| Video Retrieval | MSR-VTT-1kA | video-to-text Mean Rank | 5.5 | HunYuan_tvr (huge) |
| Video Retrieval | MSR-VTT-1kA | video-to-text Median Rank | 1 | HunYuan_tvr (huge) |
| Video Retrieval | MSR-VTT-1kA | video-to-text R@1 | 64.8 | HunYuan_tvr (huge) |
| Video Retrieval | MSR-VTT-1kA | video-to-text R@10 | 91.1 | HunYuan_tvr (huge) |
| Video Retrieval | MSR-VTT-1kA | video-to-text R@5 | 84.9 | HunYuan_tvr (huge) |
| Video Retrieval | MSR-VTT-1kA | text-to-video R@1 | 55 | HunYuan_tvr |
| Video Retrieval | MSR-VTT-1kA | video-to-text Mean Rank | 7.7 | HunYuan_tvr |
| Video Retrieval | MSR-VTT-1kA | video-to-text Median Rank | 1 | HunYuan_tvr |
| Video Retrieval | MSR-VTT-1kA | video-to-text R@1 | 55.5 | HunYuan_tvr |
| Video Retrieval | MSR-VTT-1kA | video-to-text R@10 | 85.8 | HunYuan_tvr |
| Video Retrieval | MSR-VTT-1kA | video-to-text R@5 | 78.4 | HunYuan_tvr |
| Video Retrieval | ActivityNet | text-to-video Mean Rank | 4 | HunYuan_tvr |
| Video Retrieval | ActivityNet | text-to-video Median Rank | 1 | HunYuan_tvr |
| Video Retrieval | ActivityNet | text-to-video R@1 | 57.3 | HunYuan_tvr |
| Video Retrieval | ActivityNet | text-to-video R@10 | 93.1 | HunYuan_tvr |
| Video Retrieval | ActivityNet | text-to-video R@5 | 84.8 | HunYuan_tvr |
| Video Retrieval | ActivityNet | video-to-text Mean Rank | 3.4 | HunYuan_tvr |
| Video Retrieval | ActivityNet | video-to-text Median Rank | 1 | HunYuan_tvr |
| Video Retrieval | ActivityNet | video-to-text R@1 | 57.7 | HunYuan_tvr |
| Video Retrieval | ActivityNet | video-to-text R@10 | 93.9 | HunYuan_tvr |
| Video Retrieval | ActivityNet | video-to-text R@5 | 85.7 | HunYuan_tvr |
| Video Retrieval | DiDeMo | text-to-video Mean Rank | 13.7 | HunYuan_tvr (huge) |
| Video Retrieval | DiDeMo | text-to-video Median Rank | 1 | HunYuan_tvr (huge) |
| Video Retrieval | DiDeMo | text-to-video R@1 | 52.7 | HunYuan_tvr (huge) |
| Video Retrieval | DiDeMo | text-to-video R@10 | 85.2 | HunYuan_tvr (huge) |
| Video Retrieval | DiDeMo | text-to-video R@5 | 77.8 | HunYuan_tvr (huge) |
| Video Retrieval | DiDeMo | video-to-text Mean Rank | 9.1 | HunYuan_tvr (huge) |
| Video Retrieval | DiDeMo | video-to-text Median Rank | 1 | HunYuan_tvr (huge) |
| Video Retrieval | DiDeMo | video-to-text R@1 | 54.1 | HunYuan_tvr (huge) |
| Video Retrieval | DiDeMo | video-to-text R@10 | 86.8 | HunYuan_tvr (huge) |
| Video Retrieval | DiDeMo | video-to-text R@5 | 78.3 | HunYuan_tvr (huge) |
| Video Retrieval | DiDeMo | text-to-video Mean Rank | 11.1 | HunYuan_tvr |
| Video Retrieval | DiDeMo | text-to-video Median Rank | 1 | HunYuan_tvr |
| Video Retrieval | DiDeMo | text-to-video R@1 | 52.1 | HunYuan_tvr |
| Video Retrieval | DiDeMo | text-to-video R@10 | 85.7 | HunYuan_tvr |
| Video Retrieval | DiDeMo | text-to-video R@5 | 78.2 | HunYuan_tvr |
| Video Retrieval | DiDeMo | video-to-text Mean Rank | 7.1 | HunYuan_tvr |
| Video Retrieval | DiDeMo | video-to-text Median Rank | 1 | HunYuan_tvr |
| Video Retrieval | DiDeMo | video-to-text R@1 | 54.8 | HunYuan_tvr |
| Video Retrieval | DiDeMo | video-to-text R@10 | 87.2 | HunYuan_tvr |
| Video Retrieval | DiDeMo | video-to-text R@5 | 79.9 | HunYuan_tvr |
| Video Retrieval | LSMDC | text-to-video Mean Rank | 3.9 | HunYuan_tvr (huge) |
| Video Retrieval | LSMDC | text-to-video Median Rank | 2 | HunYuan_tvr (huge) |
| Video Retrieval | LSMDC | text-to-video R@1 | 40.4 | HunYuan_tvr (huge) |
| Video Retrieval | LSMDC | text-to-video R@10 | 92.8 | HunYuan_tvr (huge) |
| Video Retrieval | LSMDC | text-to-video R@5 | 80.1 | HunYuan_tvr (huge) |
| Video Retrieval | LSMDC | video-to-text Mean Rank | 4.3 | HunYuan_tvr (huge) |
| Video Retrieval | LSMDC | video-to-text Median Rank | 2 | HunYuan_tvr (huge) |
| Video Retrieval | LSMDC | video-to-text R@1 | 34.6 | HunYuan_tvr (huge) |
| Video Retrieval | LSMDC | video-to-text R@10 | 91.8 | HunYuan_tvr (huge) |
| Video Retrieval | LSMDC | video-to-text R@5 | 71.8 | HunYuan_tvr (huge) |
| Video Retrieval | LSMDC | text-to-video Mean Rank | 56.4 | HunYuan_tvr |
| Video Retrieval | LSMDC | text-to-video Median Rank | 7 | HunYuan_tvr |
| Video Retrieval | LSMDC | text-to-video R@1 | 29.7 | HunYuan_tvr |
| Video Retrieval | LSMDC | text-to-video R@10 | 55.4 | HunYuan_tvr |
| Video Retrieval | LSMDC | text-to-video R@5 | 46.4 | HunYuan_tvr |
| Video Retrieval | LSMDC | video-to-text Mean Rank | 48.9 | HunYuan_tvr |
| Video Retrieval | LSMDC | video-to-text Median Rank | 7 | HunYuan_tvr |
| Video Retrieval | LSMDC | video-to-text R@1 | 30.1 | HunYuan_tvr |
| Video Retrieval | LSMDC | video-to-text R@10 | 55.7 | HunYuan_tvr |
| Video Retrieval | LSMDC | video-to-text R@5 | 47.5 | HunYuan_tvr |
| Video Retrieval | MSVD | text-to-video Mean Rank | 7.6 | HunYuan_tvr (huge) |
| Video Retrieval | MSVD | text-to-video Median Rank | 1 | HunYuan_tvr (huge) |
| Video Retrieval | MSVD | text-to-video R@1 | 59 | HunYuan_tvr (huge) |
| Video Retrieval | MSVD | text-to-video R@10 | 90.3 | HunYuan_tvr (huge) |
| Video Retrieval | MSVD | text-to-video R@5 | 84 | HunYuan_tvr (huge) |
| Video Retrieval | MSVD | video-to-text Mean Rank | 7.6 | HunYuan_tvr (huge) |
| Video Retrieval | MSVD | video-to-text Median Rank | 1 | HunYuan_tvr (huge) |
| Video Retrieval | MSVD | video-to-text R@1 | 73 | HunYuan_tvr (huge) |
| Video Retrieval | MSVD | video-to-text R@10 | 96.6 | HunYuan_tvr (huge) |
| Video Retrieval | MSVD | video-to-text R@5 | 94.5 | HunYuan_tvr (huge) |
| Video Retrieval | MSVD | text-to-video Mean Rank | 7.8 | HunYuan_tvr |
| Video Retrieval | MSVD | text-to-video Median Rank | 1 | HunYuan_tvr |
| Video Retrieval | MSVD | text-to-video R@1 | 58.2 | HunYuan_tvr |
| Video Retrieval | MSVD | text-to-video R@10 | 90.1 | HunYuan_tvr |
| Video Retrieval | MSVD | text-to-video R@5 | 83.5 | HunYuan_tvr |
| Video Retrieval | MSVD | video-to-text Mean Rank | 3.8 | HunYuan_tvr |
| Video Retrieval | MSVD | video-to-text Median Rank | 1 | HunYuan_tvr |
| Video Retrieval | MSVD | video-to-text R@1 | 69.1 | HunYuan_tvr |
| Video Retrieval | MSVD | video-to-text R@10 | 95 | HunYuan_tvr |
| Video Retrieval | MSVD | video-to-text R@5 | 91.5 | HunYuan_tvr |