Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval

Fan Hu, Aozhu Chen, Ziyue Wang, Fangming Zhou, Jianfeng Dong, Xirong Li

2021-12-03Video Retrieval feature selection Ad-hoc video search Text to Video Retrieval Retrieval

Abstract

In this paper we revisit feature fusion, an old-fashioned topic, in the new context of text-to-video retrieval. Different from previous research that considers feature fusion only at one end, let it be video or text, we aim for feature fusion for both ends within a unified framework. We hypothesize that optimizing the convex combination of the features is preferred to modeling their correlations by computationally heavy multi-head self attention. We propose Lightweight Attentional Feature Fusion (LAFF). LAFF performs feature fusion at both early and late stages and at both video and text ends, making it a powerful method for exploiting diverse (off-the-shelf) features. The interpretability of LAFF can be used for feature selection. Extensive experiments on five public benchmark sets (MSR-VTT, MSVD, TGIF, VATEX and TRECVID AVS 2016-2020) justify LAFF as a new baseline for text-to-video retrieval.

Results

Task	Dataset	Metric	Value	Model
Video	MSR-VTT-1kA	text-to-video R@1	45.8	LAFF
Video	MSR-VTT-1kA	text-to-video R@10	82	LAFF
Video	MSR-VTT-1kA	text-to-video R@5	71.5	LAFF
Video	VATEX	text-to-video R@1	59.1	LAFF
Video	VATEX	text-to-video R@10	91.7	LAFF
Video	VATEX	text-to-video R@50	96.3	LAFF
Video	MSR-VTT	text-to-video R@1	29.1	LAFF
Video	MSR-VTT	text-to-video R@10	65.8	LAFF
Video	MSR-VTT	text-to-video R@5	54.9	LAFF
Video	TGIF	text-to-video R@1	24.5	LAFF
Video	TGIF	text-to-video R@10	54.5	LAFF
Video	TGIF	text-to-video R@5	45	LAFF
Video	MSVD	text-to-video R@1	45.4	LAFF
Video	MSVD	text-to-video R@10	84.6	LAFF
Video	MSVD	text-to-video R@5	76	LAFF
Ad-hoc video search	TRECVID-AVS20 (V3C1)	infAP	0.265	LAFF
Ad-hoc video search	TRECVID-AVS17 (IACC.3)	infAP	0.29	LAFF
Ad-hoc video search	TRECVID-AVS18 (IACC.3)	infAP	0.147	LAFF
Ad-hoc video search	TRECVID-AVS16 (IACC.3)	infAP	0.222	LAFF
Ad-hoc video search	TRECVID-AVS19 (V3C1)	infAP	0.192	LAFF
Video Retrieval	MSR-VTT-1kA	text-to-video R@1	45.8	LAFF
Video Retrieval	MSR-VTT-1kA	text-to-video R@10	82	LAFF
Video Retrieval	MSR-VTT-1kA	text-to-video R@5	71.5	LAFF
Video Retrieval	VATEX	text-to-video R@1	59.1	LAFF
Video Retrieval	VATEX	text-to-video R@10	91.7	LAFF
Video Retrieval	VATEX	text-to-video R@50	96.3	LAFF
Video Retrieval	MSR-VTT	text-to-video R@1	29.1	LAFF
Video Retrieval	MSR-VTT	text-to-video R@10	65.8	LAFF
Video Retrieval	MSR-VTT	text-to-video R@5	54.9	LAFF
Video Retrieval	TGIF	text-to-video R@1	24.5	LAFF
Video Retrieval	TGIF	text-to-video R@10	54.5	LAFF
Video Retrieval	TGIF	text-to-video R@5	45	LAFF
Video Retrieval	MSVD	text-to-video R@1	45.4	LAFF
Video Retrieval	MSVD	text-to-video R@10	84.6	LAFF
Video Retrieval	MSVD	text-to-video R@5	76	LAFF

Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval

Abstract

Results

Related Papers

Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval

Abstract

Results

Related Papers