TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Lightweight Attentional Feature Fusion: A New Baseline for...

Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval

Fan Hu, Aozhu Chen, Ziyue Wang, Fangming Zhou, Jianfeng Dong, Xirong Li

2021-12-03Video Retrievalfeature selectionAd-hoc video searchText to Video RetrievalRetrieval
PaperPDFCode(official)

Abstract

In this paper we revisit feature fusion, an old-fashioned topic, in the new context of text-to-video retrieval. Different from previous research that considers feature fusion only at one end, let it be video or text, we aim for feature fusion for both ends within a unified framework. We hypothesize that optimizing the convex combination of the features is preferred to modeling their correlations by computationally heavy multi-head self attention. We propose Lightweight Attentional Feature Fusion (LAFF). LAFF performs feature fusion at both early and late stages and at both video and text ends, making it a powerful method for exploiting diverse (off-the-shelf) features. The interpretability of LAFF can be used for feature selection. Extensive experiments on five public benchmark sets (MSR-VTT, MSVD, TGIF, VATEX and TRECVID AVS 2016-2020) justify LAFF as a new baseline for text-to-video retrieval.

Results

TaskDatasetMetricValueModel
VideoMSR-VTT-1kAtext-to-video R@145.8LAFF
VideoMSR-VTT-1kAtext-to-video R@1082LAFF
VideoMSR-VTT-1kAtext-to-video R@571.5LAFF
VideoVATEXtext-to-video R@159.1LAFF
VideoVATEXtext-to-video R@1091.7LAFF
VideoVATEXtext-to-video R@5096.3LAFF
VideoMSR-VTTtext-to-video R@129.1LAFF
VideoMSR-VTTtext-to-video R@1065.8LAFF
VideoMSR-VTTtext-to-video R@554.9LAFF
VideoTGIFtext-to-video R@124.5LAFF
VideoTGIFtext-to-video R@1054.5LAFF
VideoTGIFtext-to-video R@545LAFF
VideoMSVDtext-to-video R@145.4LAFF
VideoMSVDtext-to-video R@1084.6LAFF
VideoMSVDtext-to-video R@576LAFF
Ad-hoc video searchTRECVID-AVS20 (V3C1)infAP0.265LAFF
Ad-hoc video searchTRECVID-AVS17 (IACC.3)infAP0.29LAFF
Ad-hoc video searchTRECVID-AVS18 (IACC.3)infAP0.147LAFF
Ad-hoc video searchTRECVID-AVS16 (IACC.3)infAP0.222LAFF
Ad-hoc video searchTRECVID-AVS19 (V3C1)infAP0.192LAFF
Video RetrievalMSR-VTT-1kAtext-to-video R@145.8LAFF
Video RetrievalMSR-VTT-1kAtext-to-video R@1082LAFF
Video RetrievalMSR-VTT-1kAtext-to-video R@571.5LAFF
Video RetrievalVATEXtext-to-video R@159.1LAFF
Video RetrievalVATEXtext-to-video R@1091.7LAFF
Video RetrievalVATEXtext-to-video R@5096.3LAFF
Video RetrievalMSR-VTTtext-to-video R@129.1LAFF
Video RetrievalMSR-VTTtext-to-video R@1065.8LAFF
Video RetrievalMSR-VTTtext-to-video R@554.9LAFF
Video RetrievalTGIFtext-to-video R@124.5LAFF
Video RetrievalTGIFtext-to-video R@1054.5LAFF
Video RetrievalTGIFtext-to-video R@545LAFF
Video RetrievalMSVDtext-to-video R@145.4LAFF
Video RetrievalMSVDtext-to-video R@1084.6LAFF
Video RetrievalMSVDtext-to-video R@576LAFF

Related Papers

mNARX+: A surrogate model for complex dynamical systems using manifold-NARX and automatic feature selection2025-07-17From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16Context-Aware Search and Retrieval Over Erasure Channels2025-07-16