TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Video-Text Retrieval by Supervised Sparse Multi-Grained Le...

Video-Text Retrieval by Supervised Sparse Multi-Grained Learning

Yimu Wang, Peng Shi

2023-02-19Video RetrievalRepresentation LearningVideo-Text RetrievalText RetrievalRetrievalSparse Learning
PaperPDFCode(official)

Abstract

While recent progress in video-text retrieval has been advanced by the exploration of better representation learning, in this paper, we present a novel multi-grained sparse learning framework, S3MA, to learn an aligned sparse space shared between the video and the text for video-text retrieval. The shared sparse space is initialized with a finite number of sparse concepts, each of which refers to a number of words. With the text data at hand, we learn and update the shared sparse space in a supervised manner using the proposed similarity and alignment losses. Moreover, to enable multi-grained alignment, we incorporate frame representations for better modeling the video modality and calculating fine-grained and coarse-grained similarities. Benefiting from the learned shared sparse space and multi-grained similarities, extensive experiments on several video-text retrieval benchmarks demonstrate the superiority of S3MA over existing methods. Our code is available at https://github.com/yimuwangcs/Better_Cross_Modal_Retrieval.

Results

TaskDatasetMetricValueModel
VideoMSR-VTT-1kAtext-to-video R@149.8SuMA (ViT-B/16)
VideoMSR-VTT-1kAtext-to-video R@1083.9SuMA (ViT-B/16)
VideoMSR-VTT-1kAtext-to-video R@575.1SuMA (ViT-B/16)
VideoMSR-VTT-1kAvideo-to-text R@147.3SuMA (ViT-B/16)
VideoMSR-VTT-1kAvideo-to-text R@1084.3SuMA (ViT-B/16)
VideoMSR-VTT-1kAvideo-to-text R@576SuMA (ViT-B/16)
Video RetrievalMSR-VTT-1kAtext-to-video R@149.8SuMA (ViT-B/16)
Video RetrievalMSR-VTT-1kAtext-to-video R@1083.9SuMA (ViT-B/16)
Video RetrievalMSR-VTT-1kAtext-to-video R@575.1SuMA (ViT-B/16)
Video RetrievalMSR-VTT-1kAvideo-to-text R@147.3SuMA (ViT-B/16)
Video RetrievalMSR-VTT-1kAvideo-to-text R@1084.3SuMA (ViT-B/16)
Video RetrievalMSR-VTT-1kAvideo-to-text R@576SuMA (ViT-B/16)

Related Papers

Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper2025-07-20Spectral Bellman Method: Unifying Representation and Exploration in RL2025-07-17Boosting Team Modeling through Tempo-Relational Representation Learning2025-07-17From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17Similarity-Guided Diffusion for Contrastive Sequential Recommendation2025-07-16