TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Cap4Video: What Can Auxiliary Captions Do for Text-Video R...

Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?

Wenhao Wu, Haipeng Luo, Bo Fang, Jingdong Wang, Wanli Ouyang

2022-12-31CVPR 2023 1Video RetrievalData AugmentationVideo CaptioningRetrieval
PaperPDFCode(official)CodeCodeCode

Abstract

Most existing text-video retrieval methods focus on cross-modal matching between the visual content of videos and textual query sentences. However, in real-world scenarios, online videos are often accompanied by relevant text information such as titles, tags, and even subtitles, which can be utilized to match textual queries. This insight has motivated us to propose a novel approach to text-video retrieval, where we directly generate associated captions from videos using zero-shot video captioning with knowledge from web-scale pre-trained models (e.g., CLIP and GPT-2). Given the generated captions, a natural question arises: what benefits do they bring to text-video retrieval? To answer this, we introduce Cap4Video, a new framework that leverages captions in three ways: i) Input data: video-caption pairs can augment the training data. ii) Intermediate feature interaction: we perform cross-modal feature interaction between the video and caption to produce enhanced video representations. iii) Output score: the Query-Caption matching branch can complement the original Query-Video matching branch for text-video retrieval. We conduct comprehensive ablation studies to demonstrate the effectiveness of our approach. Without any post-processing, Cap4Video achieves state-of-the-art performance on four standard text-video retrieval benchmarks: MSR-VTT (51.4%), VATEX (66.6%), MSVD (51.8%), and DiDeMo (52.0%). The code is available at https://github.com/whwu95/Cap4Video .

Results

TaskDatasetMetricValueModel
VideoMSR-VTT-1kAtext-to-video Mean Rank12.4Cap4Video
VideoMSR-VTT-1kAtext-to-video Median Rank1Cap4Video
VideoMSR-VTT-1kAtext-to-video R@151.4Cap4Video
VideoMSR-VTT-1kAtext-to-video R@1083.9Cap4Video
VideoMSR-VTT-1kAtext-to-video R@575.7Cap4Video
VideoMSR-VTT-1kAvideo-to-text Mean Rank8Cap4Video
VideoMSR-VTT-1kAvideo-to-text Median Rank2Cap4Video
VideoMSR-VTT-1kAvideo-to-text R@149Cap4Video
VideoMSR-VTT-1kAvideo-to-text R@1085Cap4Video
VideoMSR-VTT-1kAvideo-to-text R@575.2Cap4Video
VideoVATEXtext-to-video MeanR2.7Cap4Video
VideoVATEXtext-to-video MedianR1Cap4Video
VideoVATEXtext-to-video R@166.6Cap4Video
VideoVATEXtext-to-video R@1097Cap4Video
VideoVATEXtext-to-video R@593.1Cap4Video
VideoVATEXvideo-to-text R@180.9Cap4Video
VideoVATEXvideo-to-text R@1099.6Cap4Video
VideoDiDeMotext-to-video Mean Rank10.5Cap4Video
VideoDiDeMotext-to-video Median Rank1Cap4Video
VideoDiDeMotext-to-video R@152Cap4Video
VideoDiDeMotext-to-video R@1087.5Cap4Video
VideoDiDeMotext-to-video R@579.4Cap4Video
VideoDiDeMovideo-to-text Mean Rank7.3Cap4Video
VideoDiDeMovideo-to-text Median Rank1Cap4Video
VideoDiDeMovideo-to-text R@151.2Cap4Video
VideoDiDeMovideo-to-text R@1087.4Cap4Video
VideoDiDeMovideo-to-text R@578.5Cap4Video
VideoMSVDtext-to-video Mean Rank8.3Cap4Video
VideoMSVDtext-to-video Median Rank1Cap4Video
VideoMSVDtext-to-video R@151.8Cap4Video
VideoMSVDtext-to-video R@1088.3Cap4Video
VideoMSVDtext-to-video R@580.8Cap4Video
VideoMSVDvideo-to-text Mean Rank2.4Cap4Video
VideoMSVDvideo-to-text Median Rank1Cap4Video
VideoMSVDvideo-to-text R@170Cap4Video
VideoMSVDvideo-to-text R@1096.2Cap4Video
VideoMSVDvideo-to-text R@593.2Cap4Video
Video RetrievalMSR-VTT-1kAtext-to-video Mean Rank12.4Cap4Video
Video RetrievalMSR-VTT-1kAtext-to-video Median Rank1Cap4Video
Video RetrievalMSR-VTT-1kAtext-to-video R@151.4Cap4Video
Video RetrievalMSR-VTT-1kAtext-to-video R@1083.9Cap4Video
Video RetrievalMSR-VTT-1kAtext-to-video R@575.7Cap4Video
Video RetrievalMSR-VTT-1kAvideo-to-text Mean Rank8Cap4Video
Video RetrievalMSR-VTT-1kAvideo-to-text Median Rank2Cap4Video
Video RetrievalMSR-VTT-1kAvideo-to-text R@149Cap4Video
Video RetrievalMSR-VTT-1kAvideo-to-text R@1085Cap4Video
Video RetrievalMSR-VTT-1kAvideo-to-text R@575.2Cap4Video
Video RetrievalVATEXtext-to-video MeanR2.7Cap4Video
Video RetrievalVATEXtext-to-video MedianR1Cap4Video
Video RetrievalVATEXtext-to-video R@166.6Cap4Video
Video RetrievalVATEXtext-to-video R@1097Cap4Video
Video RetrievalVATEXtext-to-video R@593.1Cap4Video
Video RetrievalVATEXvideo-to-text R@180.9Cap4Video
Video RetrievalVATEXvideo-to-text R@1099.6Cap4Video
Video RetrievalDiDeMotext-to-video Mean Rank10.5Cap4Video
Video RetrievalDiDeMotext-to-video Median Rank1Cap4Video
Video RetrievalDiDeMotext-to-video R@152Cap4Video
Video RetrievalDiDeMotext-to-video R@1087.5Cap4Video
Video RetrievalDiDeMotext-to-video R@579.4Cap4Video
Video RetrievalDiDeMovideo-to-text Mean Rank7.3Cap4Video
Video RetrievalDiDeMovideo-to-text Median Rank1Cap4Video
Video RetrievalDiDeMovideo-to-text R@151.2Cap4Video
Video RetrievalDiDeMovideo-to-text R@1087.4Cap4Video
Video RetrievalDiDeMovideo-to-text R@578.5Cap4Video
Video RetrievalMSVDtext-to-video Mean Rank8.3Cap4Video
Video RetrievalMSVDtext-to-video Median Rank1Cap4Video
Video RetrievalMSVDtext-to-video R@151.8Cap4Video
Video RetrievalMSVDtext-to-video R@1088.3Cap4Video
Video RetrievalMSVDtext-to-video R@580.8Cap4Video
Video RetrievalMSVDvideo-to-text Mean Rank2.4Cap4Video
Video RetrievalMSVDvideo-to-text Median Rank1Cap4Video
Video RetrievalMSVDvideo-to-text R@170Cap4Video
Video RetrievalMSVDvideo-to-text R@1096.2Cap4Video
Video RetrievalMSVDvideo-to-text R@593.2Cap4Video

Related Papers

Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management2025-07-17Pixel Perfect MegaMed: A Megapixel-Scale Vision-Language Foundation Model for Generating High Resolution Medical Images2025-07-17From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17Similarity-Guided Diffusion for Contrastive Sequential Recommendation2025-07-16Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16