Fan Hu, Aozhu Chen, Ziyue Wang, Fangming Zhou, Jianfeng Dong, Xirong Li
In this paper we revisit feature fusion, an old-fashioned topic, in the new context of text-to-video retrieval. Different from previous research that considers feature fusion only at one end, let it be video or text, we aim for feature fusion for both ends within a unified framework. We hypothesize that optimizing the convex combination of the features is preferred to modeling their correlations by computationally heavy multi-head self attention. We propose Lightweight Attentional Feature Fusion (LAFF). LAFF performs feature fusion at both early and late stages and at both video and text ends, making it a powerful method for exploiting diverse (off-the-shelf) features. The interpretability of LAFF can be used for feature selection. Extensive experiments on five public benchmark sets (MSR-VTT, MSVD, TGIF, VATEX and TRECVID AVS 2016-2020) justify LAFF as a new baseline for text-to-video retrieval.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | MSR-VTT-1kA | text-to-video R@1 | 45.8 | LAFF |
| Video | MSR-VTT-1kA | text-to-video R@10 | 82 | LAFF |
| Video | MSR-VTT-1kA | text-to-video R@5 | 71.5 | LAFF |
| Video | VATEX | text-to-video R@1 | 59.1 | LAFF |
| Video | VATEX | text-to-video R@10 | 91.7 | LAFF |
| Video | VATEX | text-to-video R@50 | 96.3 | LAFF |
| Video | MSR-VTT | text-to-video R@1 | 29.1 | LAFF |
| Video | MSR-VTT | text-to-video R@10 | 65.8 | LAFF |
| Video | MSR-VTT | text-to-video R@5 | 54.9 | LAFF |
| Video | TGIF | text-to-video R@1 | 24.5 | LAFF |
| Video | TGIF | text-to-video R@10 | 54.5 | LAFF |
| Video | TGIF | text-to-video R@5 | 45 | LAFF |
| Video | MSVD | text-to-video R@1 | 45.4 | LAFF |
| Video | MSVD | text-to-video R@10 | 84.6 | LAFF |
| Video | MSVD | text-to-video R@5 | 76 | LAFF |
| Ad-hoc video search | TRECVID-AVS20 (V3C1) | infAP | 0.265 | LAFF |
| Ad-hoc video search | TRECVID-AVS17 (IACC.3) | infAP | 0.29 | LAFF |
| Ad-hoc video search | TRECVID-AVS18 (IACC.3) | infAP | 0.147 | LAFF |
| Ad-hoc video search | TRECVID-AVS16 (IACC.3) | infAP | 0.222 | LAFF |
| Ad-hoc video search | TRECVID-AVS19 (V3C1) | infAP | 0.192 | LAFF |
| Video Retrieval | MSR-VTT-1kA | text-to-video R@1 | 45.8 | LAFF |
| Video Retrieval | MSR-VTT-1kA | text-to-video R@10 | 82 | LAFF |
| Video Retrieval | MSR-VTT-1kA | text-to-video R@5 | 71.5 | LAFF |
| Video Retrieval | VATEX | text-to-video R@1 | 59.1 | LAFF |
| Video Retrieval | VATEX | text-to-video R@10 | 91.7 | LAFF |
| Video Retrieval | VATEX | text-to-video R@50 | 96.3 | LAFF |
| Video Retrieval | MSR-VTT | text-to-video R@1 | 29.1 | LAFF |
| Video Retrieval | MSR-VTT | text-to-video R@10 | 65.8 | LAFF |
| Video Retrieval | MSR-VTT | text-to-video R@5 | 54.9 | LAFF |
| Video Retrieval | TGIF | text-to-video R@1 | 24.5 | LAFF |
| Video Retrieval | TGIF | text-to-video R@10 | 54.5 | LAFF |
| Video Retrieval | TGIF | text-to-video R@5 | 45 | LAFF |
| Video Retrieval | MSVD | text-to-video R@1 | 45.4 | LAFF |
| Video Retrieval | MSVD | text-to-video R@10 | 84.6 | LAFF |
| Video Retrieval | MSVD | text-to-video R@5 | 76 | LAFF |