Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations

Jie Jiang, Shaobo Min, Weijie Kong, Dihong Gong, Hongfa Wang, Zhifeng Li, Wei Liu

2022-04-07Denoising Video Retrieval Sentence Embeddings Contrastive Learning Retrieval

Abstract

Text-Video Retrieval plays an important role in multi-modal understanding and has attracted increasing attention in recent years. Most existing methods focus on constructing contrastive pairs between whole videos and complete caption sentences, while overlooking fine-grained cross-modal relationships, e.g., clip-phrase or frame-word. In this paper, we propose a novel method, named Hierarchical Cross-Modal Interaction (HCMI), to explore multi-level cross-modal relationships among video-sentence, clip-phrase, and frame-word for text-video retrieval. Considering intrinsic semantic frame relations, HCMI performs self-attention to explore frame-level correlations and adaptively cluster correlated frames into clip-level and video-level representations. In this way, HCMI constructs multi-level video representations for frame-clip-video granularities to capture fine-grained video content, and multi-level text representations at word-phrase-sentence granularities for the text modality. With multi-level representations for video and text, hierarchical contrastive learning is designed to explore fine-grained cross-modal relationships, i.e., frame-word, clip-phrase, and video-sentence, which enables HCMI to achieve a comprehensive semantic comparison between video and text modalities. Further boosted by adaptive label denoising and marginal sample enhancement, HCMI achieves new state-of-the-art results on various benchmarks, e.g., Rank@1 of 55.0%, 58.2%, 29.7%, 52.1%, and 57.3% on MSR-VTT, MSVD, LSMDC, DiDemo, and ActivityNet, respectively.

Results

Task	Dataset	Metric	Value	Model
Video	MSR-VTT-1kA	text-to-video Mean Rank	9.3	HunYuan_tvr (huge)
Video	MSR-VTT-1kA	text-to-video Median Rank	1	HunYuan_tvr (huge)
Video	MSR-VTT-1kA	text-to-video R@1	62.9	HunYuan_tvr (huge)
Video	MSR-VTT-1kA	text-to-video R@10	90.8	HunYuan_tvr (huge)
Video	MSR-VTT-1kA	text-to-video R@5	84.5	HunYuan_tvr (huge)
Video	MSR-VTT-1kA	video-to-text Mean Rank	5.5	HunYuan_tvr (huge)
Video	MSR-VTT-1kA	video-to-text Median Rank	1	HunYuan_tvr (huge)
Video	MSR-VTT-1kA	video-to-text R@1	64.8	HunYuan_tvr (huge)
Video	MSR-VTT-1kA	video-to-text R@10	91.1	HunYuan_tvr (huge)
Video	MSR-VTT-1kA	video-to-text R@5	84.9	HunYuan_tvr (huge)
Video	MSR-VTT-1kA	text-to-video R@1	55	HunYuan_tvr
Video	MSR-VTT-1kA	video-to-text Mean Rank	7.7	HunYuan_tvr
Video	MSR-VTT-1kA	video-to-text Median Rank	1	HunYuan_tvr
Video	MSR-VTT-1kA	video-to-text R@1	55.5	HunYuan_tvr
Video	MSR-VTT-1kA	video-to-text R@10	85.8	HunYuan_tvr
Video	MSR-VTT-1kA	video-to-text R@5	78.4	HunYuan_tvr
Video	ActivityNet	text-to-video Mean Rank	4	HunYuan_tvr
Video	ActivityNet	text-to-video Median Rank	1	HunYuan_tvr
Video	ActivityNet	text-to-video R@1	57.3	HunYuan_tvr
Video	ActivityNet	text-to-video R@10	93.1	HunYuan_tvr
Video	ActivityNet	text-to-video R@5	84.8	HunYuan_tvr
Video	ActivityNet	video-to-text Mean Rank	3.4	HunYuan_tvr
Video	ActivityNet	video-to-text Median Rank	1	HunYuan_tvr
Video	ActivityNet	video-to-text R@1	57.7	HunYuan_tvr
Video	ActivityNet	video-to-text R@10	93.9	HunYuan_tvr
Video	ActivityNet	video-to-text R@5	85.7	HunYuan_tvr
Video	DiDeMo	text-to-video Mean Rank	13.7	HunYuan_tvr (huge)
Video	DiDeMo	text-to-video Median Rank	1	HunYuan_tvr (huge)
Video	DiDeMo	text-to-video R@1	52.7	HunYuan_tvr (huge)
Video	DiDeMo	text-to-video R@10	85.2	HunYuan_tvr (huge)
Video	DiDeMo	text-to-video R@5	77.8	HunYuan_tvr (huge)
Video	DiDeMo	video-to-text Mean Rank	9.1	HunYuan_tvr (huge)
Video	DiDeMo	video-to-text Median Rank	1	HunYuan_tvr (huge)
Video	DiDeMo	video-to-text R@1	54.1	HunYuan_tvr (huge)
Video	DiDeMo	video-to-text R@10	86.8	HunYuan_tvr (huge)
Video	DiDeMo	video-to-text R@5	78.3	HunYuan_tvr (huge)
Video	DiDeMo	text-to-video Mean Rank	11.1	HunYuan_tvr
Video	DiDeMo	text-to-video Median Rank	1	HunYuan_tvr
Video	DiDeMo	text-to-video R@1	52.1	HunYuan_tvr
Video	DiDeMo	text-to-video R@10	85.7	HunYuan_tvr
Video	DiDeMo	text-to-video R@5	78.2	HunYuan_tvr
Video	DiDeMo	video-to-text Mean Rank	7.1	HunYuan_tvr
Video	DiDeMo	video-to-text Median Rank	1	HunYuan_tvr
Video	DiDeMo	video-to-text R@1	54.8	HunYuan_tvr
Video	DiDeMo	video-to-text R@10	87.2	HunYuan_tvr
Video	DiDeMo	video-to-text R@5	79.9	HunYuan_tvr
Video	LSMDC	text-to-video Mean Rank	3.9	HunYuan_tvr (huge)
Video	LSMDC	text-to-video Median Rank	2	HunYuan_tvr (huge)
Video	LSMDC	text-to-video R@1	40.4	HunYuan_tvr (huge)
Video	LSMDC	text-to-video R@10	92.8	HunYuan_tvr (huge)
Video	LSMDC	text-to-video R@5	80.1	HunYuan_tvr (huge)
Video	LSMDC	video-to-text Mean Rank	4.3	HunYuan_tvr (huge)
Video	LSMDC	video-to-text Median Rank	2	HunYuan_tvr (huge)
Video	LSMDC	video-to-text R@1	34.6	HunYuan_tvr (huge)
Video	LSMDC	video-to-text R@10	91.8	HunYuan_tvr (huge)
Video	LSMDC	video-to-text R@5	71.8	HunYuan_tvr (huge)
Video	LSMDC	text-to-video Mean Rank	56.4	HunYuan_tvr
Video	LSMDC	text-to-video Median Rank	7	HunYuan_tvr
Video	LSMDC	text-to-video R@1	29.7	HunYuan_tvr
Video	LSMDC	text-to-video R@10	55.4	HunYuan_tvr
Video	LSMDC	text-to-video R@5	46.4	HunYuan_tvr
Video	LSMDC	video-to-text Mean Rank	48.9	HunYuan_tvr
Video	LSMDC	video-to-text Median Rank	7	HunYuan_tvr
Video	LSMDC	video-to-text R@1	30.1	HunYuan_tvr
Video	LSMDC	video-to-text R@10	55.7	HunYuan_tvr
Video	LSMDC	video-to-text R@5	47.5	HunYuan_tvr
Video	MSVD	text-to-video Mean Rank	7.6	HunYuan_tvr (huge)
Video	MSVD	text-to-video Median Rank	1	HunYuan_tvr (huge)
Video	MSVD	text-to-video R@1	59	HunYuan_tvr (huge)
Video	MSVD	text-to-video R@10	90.3	HunYuan_tvr (huge)
Video	MSVD	text-to-video R@5	84	HunYuan_tvr (huge)
Video	MSVD	video-to-text Mean Rank	7.6	HunYuan_tvr (huge)
Video	MSVD	video-to-text Median Rank	1	HunYuan_tvr (huge)
Video	MSVD	video-to-text R@1	73	HunYuan_tvr (huge)
Video	MSVD	video-to-text R@10	96.6	HunYuan_tvr (huge)
Video	MSVD	video-to-text R@5	94.5	HunYuan_tvr (huge)
Video	MSVD	text-to-video Mean Rank	7.8	HunYuan_tvr
Video	MSVD	text-to-video Median Rank	1	HunYuan_tvr
Video	MSVD	text-to-video R@1	58.2	HunYuan_tvr
Video	MSVD	text-to-video R@10	90.1	HunYuan_tvr
Video	MSVD	text-to-video R@5	83.5	HunYuan_tvr
Video	MSVD	video-to-text Mean Rank	3.8	HunYuan_tvr
Video	MSVD	video-to-text Median Rank	1	HunYuan_tvr
Video	MSVD	video-to-text R@1	69.1	HunYuan_tvr
Video	MSVD	video-to-text R@10	95	HunYuan_tvr
Video	MSVD	video-to-text R@5	91.5	HunYuan_tvr
Video Retrieval	MSR-VTT-1kA	text-to-video Mean Rank	9.3	HunYuan_tvr (huge)
Video Retrieval	MSR-VTT-1kA	text-to-video Median Rank	1	HunYuan_tvr (huge)
Video Retrieval	MSR-VTT-1kA	text-to-video R@1	62.9	HunYuan_tvr (huge)
Video Retrieval	MSR-VTT-1kA	text-to-video R@10	90.8	HunYuan_tvr (huge)
Video Retrieval	MSR-VTT-1kA	text-to-video R@5	84.5	HunYuan_tvr (huge)
Video Retrieval	MSR-VTT-1kA	video-to-text Mean Rank	5.5	HunYuan_tvr (huge)
Video Retrieval	MSR-VTT-1kA	video-to-text Median Rank	1	HunYuan_tvr (huge)
Video Retrieval	MSR-VTT-1kA	video-to-text R@1	64.8	HunYuan_tvr (huge)
Video Retrieval	MSR-VTT-1kA	video-to-text R@10	91.1	HunYuan_tvr (huge)
Video Retrieval	MSR-VTT-1kA	video-to-text R@5	84.9	HunYuan_tvr (huge)
Video Retrieval	MSR-VTT-1kA	text-to-video R@1	55	HunYuan_tvr
Video Retrieval	MSR-VTT-1kA	video-to-text Mean Rank	7.7	HunYuan_tvr
Video Retrieval	MSR-VTT-1kA	video-to-text Median Rank	1	HunYuan_tvr
Video Retrieval	MSR-VTT-1kA	video-to-text R@1	55.5	HunYuan_tvr
Video Retrieval	MSR-VTT-1kA	video-to-text R@10	85.8	HunYuan_tvr
Video Retrieval	MSR-VTT-1kA	video-to-text R@5	78.4	HunYuan_tvr
Video Retrieval	ActivityNet	text-to-video Mean Rank	4	HunYuan_tvr
Video Retrieval	ActivityNet	text-to-video Median Rank	1	HunYuan_tvr
Video Retrieval	ActivityNet	text-to-video R@1	57.3	HunYuan_tvr
Video Retrieval	ActivityNet	text-to-video R@10	93.1	HunYuan_tvr
Video Retrieval	ActivityNet	text-to-video R@5	84.8	HunYuan_tvr
Video Retrieval	ActivityNet	video-to-text Mean Rank	3.4	HunYuan_tvr
Video Retrieval	ActivityNet	video-to-text Median Rank	1	HunYuan_tvr
Video Retrieval	ActivityNet	video-to-text R@1	57.7	HunYuan_tvr
Video Retrieval	ActivityNet	video-to-text R@10	93.9	HunYuan_tvr
Video Retrieval	ActivityNet	video-to-text R@5	85.7	HunYuan_tvr
Video Retrieval	DiDeMo	text-to-video Mean Rank	13.7	HunYuan_tvr (huge)
Video Retrieval	DiDeMo	text-to-video Median Rank	1	HunYuan_tvr (huge)
Video Retrieval	DiDeMo	text-to-video R@1	52.7	HunYuan_tvr (huge)
Video Retrieval	DiDeMo	text-to-video R@10	85.2	HunYuan_tvr (huge)
Video Retrieval	DiDeMo	text-to-video R@5	77.8	HunYuan_tvr (huge)
Video Retrieval	DiDeMo	video-to-text Mean Rank	9.1	HunYuan_tvr (huge)
Video Retrieval	DiDeMo	video-to-text Median Rank	1	HunYuan_tvr (huge)
Video Retrieval	DiDeMo	video-to-text R@1	54.1	HunYuan_tvr (huge)
Video Retrieval	DiDeMo	video-to-text R@10	86.8	HunYuan_tvr (huge)
Video Retrieval	DiDeMo	video-to-text R@5	78.3	HunYuan_tvr (huge)
Video Retrieval	DiDeMo	text-to-video Mean Rank	11.1	HunYuan_tvr
Video Retrieval	DiDeMo	text-to-video Median Rank	1	HunYuan_tvr
Video Retrieval	DiDeMo	text-to-video R@1	52.1	HunYuan_tvr
Video Retrieval	DiDeMo	text-to-video R@10	85.7	HunYuan_tvr
Video Retrieval	DiDeMo	text-to-video R@5	78.2	HunYuan_tvr
Video Retrieval	DiDeMo	video-to-text Mean Rank	7.1	HunYuan_tvr
Video Retrieval	DiDeMo	video-to-text Median Rank	1	HunYuan_tvr
Video Retrieval	DiDeMo	video-to-text R@1	54.8	HunYuan_tvr
Video Retrieval	DiDeMo	video-to-text R@10	87.2	HunYuan_tvr
Video Retrieval	DiDeMo	video-to-text R@5	79.9	HunYuan_tvr
Video Retrieval	LSMDC	text-to-video Mean Rank	3.9	HunYuan_tvr (huge)
Video Retrieval	LSMDC	text-to-video Median Rank	2	HunYuan_tvr (huge)
Video Retrieval	LSMDC	text-to-video R@1	40.4	HunYuan_tvr (huge)
Video Retrieval	LSMDC	text-to-video R@10	92.8	HunYuan_tvr (huge)
Video Retrieval	LSMDC	text-to-video R@5	80.1	HunYuan_tvr (huge)
Video Retrieval	LSMDC	video-to-text Mean Rank	4.3	HunYuan_tvr (huge)
Video Retrieval	LSMDC	video-to-text Median Rank	2	HunYuan_tvr (huge)
Video Retrieval	LSMDC	video-to-text R@1	34.6	HunYuan_tvr (huge)
Video Retrieval	LSMDC	video-to-text R@10	91.8	HunYuan_tvr (huge)
Video Retrieval	LSMDC	video-to-text R@5	71.8	HunYuan_tvr (huge)
Video Retrieval	LSMDC	text-to-video Mean Rank	56.4	HunYuan_tvr
Video Retrieval	LSMDC	text-to-video Median Rank	7	HunYuan_tvr
Video Retrieval	LSMDC	text-to-video R@1	29.7	HunYuan_tvr
Video Retrieval	LSMDC	text-to-video R@10	55.4	HunYuan_tvr
Video Retrieval	LSMDC	text-to-video R@5	46.4	HunYuan_tvr
Video Retrieval	LSMDC	video-to-text Mean Rank	48.9	HunYuan_tvr
Video Retrieval	LSMDC	video-to-text Median Rank	7	HunYuan_tvr
Video Retrieval	LSMDC	video-to-text R@1	30.1	HunYuan_tvr
Video Retrieval	LSMDC	video-to-text R@10	55.7	HunYuan_tvr
Video Retrieval	LSMDC	video-to-text R@5	47.5	HunYuan_tvr
Video Retrieval	MSVD	text-to-video Mean Rank	7.6	HunYuan_tvr (huge)
Video Retrieval	MSVD	text-to-video Median Rank	1	HunYuan_tvr (huge)
Video Retrieval	MSVD	text-to-video R@1	59	HunYuan_tvr (huge)
Video Retrieval	MSVD	text-to-video R@10	90.3	HunYuan_tvr (huge)
Video Retrieval	MSVD	text-to-video R@5	84	HunYuan_tvr (huge)
Video Retrieval	MSVD	video-to-text Mean Rank	7.6	HunYuan_tvr (huge)
Video Retrieval	MSVD	video-to-text Median Rank	1	HunYuan_tvr (huge)
Video Retrieval	MSVD	video-to-text R@1	73	HunYuan_tvr (huge)
Video Retrieval	MSVD	video-to-text R@10	96.6	HunYuan_tvr (huge)
Video Retrieval	MSVD	video-to-text R@5	94.5	HunYuan_tvr (huge)
Video Retrieval	MSVD	text-to-video Mean Rank	7.8	HunYuan_tvr
Video Retrieval	MSVD	text-to-video Median Rank	1	HunYuan_tvr
Video Retrieval	MSVD	text-to-video R@1	58.2	HunYuan_tvr
Video Retrieval	MSVD	text-to-video R@10	90.1	HunYuan_tvr
Video Retrieval	MSVD	text-to-video R@5	83.5	HunYuan_tvr
Video Retrieval	MSVD	video-to-text Mean Rank	3.8	HunYuan_tvr
Video Retrieval	MSVD	video-to-text Median Rank	1	HunYuan_tvr
Video Retrieval	MSVD	video-to-text R@1	69.1	HunYuan_tvr
Video Retrieval	MSVD	video-to-text R@10	95	HunYuan_tvr
Video Retrieval	MSVD	video-to-text R@5	91.5	HunYuan_tvr

Abstract

Results

Task	Dataset	Metric	Value	Model
Video	MSR-VTT-1kA	text-to-video Mean Rank	9.3	HunYuan_tvr (huge)
Video	MSR-VTT-1kA	text-to-video Median Rank	1	HunYuan_tvr (huge)
Video	MSR-VTT-1kA	text-to-video R@1	62.9	HunYuan_tvr (huge)
Video	MSR-VTT-1kA	text-to-video R@10	90.8	HunYuan_tvr (huge)
Video	MSR-VTT-1kA	text-to-video R@5	84.5	HunYuan_tvr (huge)
Video	MSR-VTT-1kA	video-to-text Mean Rank	5.5	HunYuan_tvr (huge)
Video	MSR-VTT-1kA	video-to-text Median Rank	1	HunYuan_tvr (huge)
Video	MSR-VTT-1kA	video-to-text R@1	64.8	HunYuan_tvr (huge)
Video	MSR-VTT-1kA	video-to-text R@10	91.1	HunYuan_tvr (huge)
Video	MSR-VTT-1kA	video-to-text R@5	84.9	HunYuan_tvr (huge)
Video	MSR-VTT-1kA	text-to-video R@1	55	HunYuan_tvr
Video	MSR-VTT-1kA	video-to-text Mean Rank	7.7	HunYuan_tvr
Video	MSR-VTT-1kA	video-to-text Median Rank	1	HunYuan_tvr
Video	MSR-VTT-1kA	video-to-text R@1	55.5	HunYuan_tvr
Video	MSR-VTT-1kA	video-to-text R@10	85.8	HunYuan_tvr
Video	MSR-VTT-1kA	video-to-text R@5	78.4	HunYuan_tvr
Video	ActivityNet	text-to-video Mean Rank	4	HunYuan_tvr
Video	ActivityNet	text-to-video Median Rank	1	HunYuan_tvr
Video	ActivityNet	text-to-video R@1	57.3	HunYuan_tvr
Video	ActivityNet	text-to-video R@10	93.1	HunYuan_tvr
Video	ActivityNet	text-to-video R@5	84.8	HunYuan_tvr
Video	ActivityNet	video-to-text Mean Rank	3.4	HunYuan_tvr
Video	ActivityNet	video-to-text Median Rank	1	HunYuan_tvr
Video	ActivityNet	video-to-text R@1	57.7	HunYuan_tvr
Video	ActivityNet	video-to-text R@10	93.9	HunYuan_tvr
Video	ActivityNet	video-to-text R@5	85.7	HunYuan_tvr
Video	DiDeMo	text-to-video Mean Rank	13.7	HunYuan_tvr (huge)
Video	DiDeMo	text-to-video Median Rank	1	HunYuan_tvr (huge)
Video	DiDeMo	text-to-video R@1	52.7	HunYuan_tvr (huge)
Video	DiDeMo	text-to-video R@10	85.2	HunYuan_tvr (huge)
Video	DiDeMo	text-to-video R@5	77.8	HunYuan_tvr (huge)
Video	DiDeMo	video-to-text Mean Rank	9.1	HunYuan_tvr (huge)
Video	DiDeMo	video-to-text Median Rank	1	HunYuan_tvr (huge)
Video	DiDeMo	video-to-text R@1	54.1	HunYuan_tvr (huge)
Video	DiDeMo	video-to-text R@10	86.8	HunYuan_tvr (huge)
Video	DiDeMo	video-to-text R@5	78.3	HunYuan_tvr (huge)
Video	DiDeMo	text-to-video Mean Rank	11.1	HunYuan_tvr
Video	DiDeMo	text-to-video Median Rank	1	HunYuan_tvr
Video	DiDeMo	text-to-video R@1	52.1	HunYuan_tvr
Video	DiDeMo	text-to-video R@10	85.7	HunYuan_tvr
Video	DiDeMo	text-to-video R@5	78.2	HunYuan_tvr
Video	DiDeMo	video-to-text Mean Rank	7.1	HunYuan_tvr
Video	DiDeMo	video-to-text Median Rank	1	HunYuan_tvr
Video	DiDeMo	video-to-text R@1	54.8	HunYuan_tvr
Video	DiDeMo	video-to-text R@10	87.2	HunYuan_tvr
Video	DiDeMo	video-to-text R@5	79.9	HunYuan_tvr
Video	LSMDC	text-to-video Mean Rank	3.9	HunYuan_tvr (huge)
Video	LSMDC	text-to-video Median Rank	2	HunYuan_tvr (huge)
Video	LSMDC	text-to-video R@1	40.4	HunYuan_tvr (huge)
Video	LSMDC	text-to-video R@10	92.8	HunYuan_tvr (huge)
Video	LSMDC	text-to-video R@5	80.1	HunYuan_tvr (huge)
Video	LSMDC	video-to-text Mean Rank	4.3	HunYuan_tvr (huge)
Video	LSMDC	video-to-text Median Rank	2	HunYuan_tvr (huge)
Video	LSMDC	video-to-text R@1	34.6	HunYuan_tvr (huge)
Video	LSMDC	video-to-text R@10	91.8	HunYuan_tvr (huge)
Video	LSMDC	video-to-text R@5	71.8	HunYuan_tvr (huge)
Video	LSMDC	text-to-video Mean Rank	56.4	HunYuan_tvr
Video	LSMDC	text-to-video Median Rank	7	HunYuan_tvr
Video	LSMDC	text-to-video R@1	29.7	HunYuan_tvr
Video	LSMDC	text-to-video R@10	55.4	HunYuan_tvr
Video	LSMDC	text-to-video R@5	46.4	HunYuan_tvr
Video	LSMDC	video-to-text Mean Rank	48.9	HunYuan_tvr
Video	LSMDC	video-to-text Median Rank	7	HunYuan_tvr
Video	LSMDC	video-to-text R@1	30.1	HunYuan_tvr
Video	LSMDC	video-to-text R@10	55.7	HunYuan_tvr
Video	LSMDC	video-to-text R@5	47.5	HunYuan_tvr
Video	MSVD	text-to-video Mean Rank	7.6	HunYuan_tvr (huge)
Video	MSVD	text-to-video Median Rank	1	HunYuan_tvr (huge)
Video	MSVD	text-to-video R@1	59	HunYuan_tvr (huge)
Video	MSVD	text-to-video R@10	90.3	HunYuan_tvr (huge)
Video	MSVD	text-to-video R@5	84	HunYuan_tvr (huge)
Video	MSVD	video-to-text Mean Rank	7.6	HunYuan_tvr (huge)
Video	MSVD	video-to-text Median Rank	1	HunYuan_tvr (huge)
Video	MSVD	video-to-text R@1	73	HunYuan_tvr (huge)
Video	MSVD	video-to-text R@10	96.6	HunYuan_tvr (huge)
Video	MSVD	video-to-text R@5	94.5	HunYuan_tvr (huge)
Video	MSVD	text-to-video Mean Rank	7.8	HunYuan_tvr
Video	MSVD	text-to-video Median Rank	1	HunYuan_tvr
Video	MSVD	text-to-video R@1	58.2	HunYuan_tvr
Video	MSVD	text-to-video R@10	90.1	HunYuan_tvr
Video	MSVD	text-to-video R@5	83.5	HunYuan_tvr
Video	MSVD	video-to-text Mean Rank	3.8	HunYuan_tvr
Video	MSVD	video-to-text Median Rank	1	HunYuan_tvr
Video	MSVD	video-to-text R@1	69.1	HunYuan_tvr
Video	MSVD	video-to-text R@10	95	HunYuan_tvr
Video	MSVD	video-to-text R@5	91.5	HunYuan_tvr
Video Retrieval	MSR-VTT-1kA	text-to-video Mean Rank	9.3	HunYuan_tvr (huge)
Video Retrieval	MSR-VTT-1kA	text-to-video Median Rank	1	HunYuan_tvr (huge)
Video Retrieval	MSR-VTT-1kA	text-to-video R@1	62.9	HunYuan_tvr (huge)
Video Retrieval	MSR-VTT-1kA	text-to-video R@10	90.8	HunYuan_tvr (huge)
Video Retrieval	MSR-VTT-1kA	text-to-video R@5	84.5	HunYuan_tvr (huge)
Video Retrieval	MSR-VTT-1kA	video-to-text Mean Rank	5.5	HunYuan_tvr (huge)
Video Retrieval	MSR-VTT-1kA	video-to-text Median Rank	1	HunYuan_tvr (huge)
Video Retrieval	MSR-VTT-1kA	video-to-text R@1	64.8	HunYuan_tvr (huge)
Video Retrieval	MSR-VTT-1kA	video-to-text R@10	91.1	HunYuan_tvr (huge)
Video Retrieval	MSR-VTT-1kA	video-to-text R@5	84.9	HunYuan_tvr (huge)
Video Retrieval	MSR-VTT-1kA	text-to-video R@1	55	HunYuan_tvr
Video Retrieval	MSR-VTT-1kA	video-to-text Mean Rank	7.7	HunYuan_tvr
Video Retrieval	MSR-VTT-1kA	video-to-text Median Rank	1	HunYuan_tvr
Video Retrieval	MSR-VTT-1kA	video-to-text R@1	55.5	HunYuan_tvr
Video Retrieval	MSR-VTT-1kA	video-to-text R@10	85.8	HunYuan_tvr
Video Retrieval	MSR-VTT-1kA	video-to-text R@5	78.4	HunYuan_tvr
Video Retrieval	ActivityNet	text-to-video Mean Rank	4	HunYuan_tvr
Video Retrieval	ActivityNet	text-to-video Median Rank	1	HunYuan_tvr
Video Retrieval	ActivityNet	text-to-video R@1	57.3	HunYuan_tvr
Video Retrieval	ActivityNet	text-to-video R@10	93.1	HunYuan_tvr
Video Retrieval	ActivityNet	text-to-video R@5	84.8	HunYuan_tvr
Video Retrieval	ActivityNet	video-to-text Mean Rank	3.4	HunYuan_tvr
Video Retrieval	ActivityNet	video-to-text Median Rank	1	HunYuan_tvr
Video Retrieval	ActivityNet	video-to-text R@1	57.7	HunYuan_tvr
Video Retrieval	ActivityNet	video-to-text R@10	93.9	HunYuan_tvr
Video Retrieval	ActivityNet	video-to-text R@5	85.7	HunYuan_tvr
Video Retrieval	DiDeMo	text-to-video Mean Rank	13.7	HunYuan_tvr (huge)
Video Retrieval	DiDeMo	text-to-video Median Rank	1	HunYuan_tvr (huge)
Video Retrieval	DiDeMo	text-to-video R@1	52.7	HunYuan_tvr (huge)
Video Retrieval	DiDeMo	text-to-video R@10	85.2	HunYuan_tvr (huge)
Video Retrieval	DiDeMo	text-to-video R@5	77.8	HunYuan_tvr (huge)
Video Retrieval	DiDeMo	video-to-text Mean Rank	9.1	HunYuan_tvr (huge)
Video Retrieval	DiDeMo	video-to-text Median Rank	1	HunYuan_tvr (huge)
Video Retrieval	DiDeMo	video-to-text R@1	54.1	HunYuan_tvr (huge)
Video Retrieval	DiDeMo	video-to-text R@10	86.8	HunYuan_tvr (huge)
Video Retrieval	DiDeMo	video-to-text R@5	78.3	HunYuan_tvr (huge)
Video Retrieval	DiDeMo	text-to-video Mean Rank	11.1	HunYuan_tvr
Video Retrieval	DiDeMo	text-to-video Median Rank	1	HunYuan_tvr
Video Retrieval	DiDeMo	text-to-video R@1	52.1	HunYuan_tvr
Video Retrieval	DiDeMo	text-to-video R@10	85.7	HunYuan_tvr
Video Retrieval	DiDeMo	text-to-video R@5	78.2	HunYuan_tvr
Video Retrieval	DiDeMo	video-to-text Mean Rank	7.1	HunYuan_tvr
Video Retrieval	DiDeMo	video-to-text Median Rank	1	HunYuan_tvr
Video Retrieval	DiDeMo	video-to-text R@1	54.8	HunYuan_tvr
Video Retrieval	DiDeMo	video-to-text R@10	87.2	HunYuan_tvr
Video Retrieval	DiDeMo	video-to-text R@5	79.9	HunYuan_tvr
Video Retrieval	LSMDC	text-to-video Mean Rank	3.9	HunYuan_tvr (huge)
Video Retrieval	LSMDC	text-to-video Median Rank	2	HunYuan_tvr (huge)
Video Retrieval	LSMDC	text-to-video R@1	40.4	HunYuan_tvr (huge)
Video Retrieval	LSMDC	text-to-video R@10	92.8	HunYuan_tvr (huge)
Video Retrieval	LSMDC	text-to-video R@5	80.1	HunYuan_tvr (huge)
Video Retrieval	LSMDC	video-to-text Mean Rank	4.3	HunYuan_tvr (huge)
Video Retrieval	LSMDC	video-to-text Median Rank	2	HunYuan_tvr (huge)
Video Retrieval	LSMDC	video-to-text R@1	34.6	HunYuan_tvr (huge)
Video Retrieval	LSMDC	video-to-text R@10	91.8	HunYuan_tvr (huge)
Video Retrieval	LSMDC	video-to-text R@5	71.8	HunYuan_tvr (huge)
Video Retrieval	LSMDC	text-to-video Mean Rank	56.4	HunYuan_tvr
Video Retrieval	LSMDC	text-to-video Median Rank	7	HunYuan_tvr
Video Retrieval	LSMDC	text-to-video R@1	29.7	HunYuan_tvr
Video Retrieval	LSMDC	text-to-video R@10	55.4	HunYuan_tvr
Video Retrieval	LSMDC	text-to-video R@5	46.4	HunYuan_tvr
Video Retrieval	LSMDC	video-to-text Mean Rank	48.9	HunYuan_tvr
Video Retrieval	LSMDC	video-to-text Median Rank	7	HunYuan_tvr
Video Retrieval	LSMDC	video-to-text R@1	30.1	HunYuan_tvr
Video Retrieval	LSMDC	video-to-text R@10	55.7	HunYuan_tvr
Video Retrieval	LSMDC	video-to-text R@5	47.5	HunYuan_tvr
Video Retrieval	MSVD	text-to-video Mean Rank	7.6	HunYuan_tvr (huge)
Video Retrieval	MSVD	text-to-video Median Rank	1	HunYuan_tvr (huge)
Video Retrieval	MSVD	text-to-video R@1	59	HunYuan_tvr (huge)
Video Retrieval	MSVD	text-to-video R@10	90.3	HunYuan_tvr (huge)
Video Retrieval	MSVD	text-to-video R@5	84	HunYuan_tvr (huge)
Video Retrieval	MSVD	video-to-text Mean Rank	7.6	HunYuan_tvr (huge)
Video Retrieval	MSVD	video-to-text Median Rank	1	HunYuan_tvr (huge)
Video Retrieval	MSVD	video-to-text R@1	73	HunYuan_tvr (huge)
Video Retrieval	MSVD	video-to-text R@10	96.6	HunYuan_tvr (huge)
Video Retrieval	MSVD	video-to-text R@5	94.5	HunYuan_tvr (huge)
Video Retrieval	MSVD	text-to-video Mean Rank	7.8	HunYuan_tvr
Video Retrieval	MSVD	text-to-video Median Rank	1	HunYuan_tvr
Video Retrieval	MSVD	text-to-video R@1	58.2	HunYuan_tvr
Video Retrieval	MSVD	text-to-video R@10	90.1	HunYuan_tvr
Video Retrieval	MSVD	text-to-video R@5	83.5	HunYuan_tvr
Video Retrieval	MSVD	video-to-text Mean Rank	3.8	HunYuan_tvr
Video Retrieval	MSVD	video-to-text Median Rank	1	HunYuan_tvr
Video Retrieval	MSVD	video-to-text R@1	69.1	HunYuan_tvr
Video Retrieval	MSVD	video-to-text R@10	95	HunYuan_tvr
Video Retrieval	MSVD	video-to-text R@5	91.5	HunYuan_tvr

Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations

Abstract

Results

Related Papers

Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations

Abstract

Results

Related Papers