TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/DiffusionRet: Generative Text-Video Retrieval with Diffusi...

DiffusionRet: Generative Text-Video Retrieval with Diffusion Model

Peng Jin, Hao Li, Zesen Cheng, Kehan Li, Xiangyang Ji, Chang Liu, Li Yuan, Jie Chen

2023-03-17ICCV 2023 1Video RetrievalRetrieval
PaperPDFCodeCodeCodeCode(official)

Abstract

Existing text-video retrieval solutions are, in essence, discriminant models focused on maximizing the conditional likelihood, i.e., p(candidates|query). While straightforward, this de facto paradigm overlooks the underlying data distribution p(query), which makes it challenging to identify out-of-distribution data. To address this limitation, we creatively tackle this task from a generative viewpoint and model the correlation between the text and the video as their joint probability p(candidates,query). This is accomplished through a diffusion-based text-video retrieval framework (DiffusionRet), which models the retrieval task as a process of gradually generating joint distribution from noise. During training, DiffusionRet is optimized from both the generation and discrimination perspectives, with the generator being optimized by generation loss and the feature extractor trained with contrastive loss. In this way, DiffusionRet cleverly leverages the strengths of both generative and discriminative methods. Extensive experiments on five commonly used text-video retrieval benchmarks, including MSRVTT, LSMDC, MSVD, ActivityNet Captions, and DiDeMo, with superior performances, justify the efficacy of our method. More encouragingly, without any modification, DiffusionRet even performs well in out-domain retrieval settings. We believe this work brings fundamental insights into the related fields. Code is available at https://github.com/jpthu17/DiffusionRet.

Results

TaskDatasetMetricValueModel
VideoMSR-VTT-1kAtext-to-video Mean Rank12.1DiffusionRet
VideoMSR-VTT-1kAtext-to-video Median Rank2DiffusionRet
VideoMSR-VTT-1kAtext-to-video R@149DiffusionRet
VideoMSR-VTT-1kAtext-to-video R@1082.7DiffusionRet
VideoMSR-VTT-1kAtext-to-video R@575.2DiffusionRet
VideoMSR-VTT-1kAvideo-to-text Mean Rank8.8DiffusionRet
VideoMSR-VTT-1kAvideo-to-text Median Rank2DiffusionRet
VideoMSR-VTT-1kAvideo-to-text R@147.7DiffusionRet
VideoMSR-VTT-1kAvideo-to-text R@1084.5DiffusionRet
VideoMSR-VTT-1kAvideo-to-text R@573.8DiffusionRet
VideoMSR-VTT-1kAtext-to-video Mean Rank12.1DiffusionRet+QB-Norm
VideoMSR-VTT-1kAtext-to-video Median Rank2DiffusionRet+QB-Norm
VideoMSR-VTT-1kAtext-to-video R@148.9DiffusionRet+QB-Norm
VideoMSR-VTT-1kAtext-to-video R@1083.1DiffusionRet+QB-Norm
VideoMSR-VTT-1kAtext-to-video R@575.2DiffusionRet+QB-Norm
VideoMSR-VTT-1kAvideo-to-text Mean Rank8.5DiffusionRet+QB-Norm
VideoMSR-VTT-1kAvideo-to-text Median Rank2DiffusionRet+QB-Norm
VideoMSR-VTT-1kAvideo-to-text R@149.3DiffusionRet+QB-Norm
VideoMSR-VTT-1kAvideo-to-text R@1083.8DiffusionRet+QB-Norm
VideoMSR-VTT-1kAvideo-to-text R@574.3DiffusionRet+QB-Norm
VideoActivityNettext-to-video Mean Rank6.8DiffusionRet+QB-Norm
VideoActivityNettext-to-video Median Rank2DiffusionRet+QB-Norm
VideoActivityNettext-to-video R@148.1DiffusionRet+QB-Norm
VideoActivityNettext-to-video R@1085.7DiffusionRet+QB-Norm
VideoActivityNetvideo-to-text Mean Rank6.7DiffusionRet+QB-Norm
VideoActivityNetvideo-to-text Median Rank2DiffusionRet+QB-Norm
VideoActivityNetvideo-to-text R@147.4DiffusionRet+QB-Norm
VideoActivityNetvideo-to-text R@1086.7DiffusionRet+QB-Norm
VideoActivityNetvideo-to-text R@576.3DiffusionRet+QB-Norm
VideoActivityNettext-to-video Mean Rank6.5DiffusionRet
VideoActivityNettext-to-video Median Rank2DiffusionRet
VideoActivityNettext-to-video R@145.8DiffusionRet
VideoActivityNettext-to-video R@1086.3DiffusionRet
VideoActivityNettext-to-video R@575.6DiffusionRet
VideoActivityNetvideo-to-text Mean Rank6.3DiffusionRet
VideoActivityNetvideo-to-text Median Rank2DiffusionRet
VideoActivityNetvideo-to-text R@143.8DiffusionRet
VideoActivityNetvideo-to-text R@1086.7DiffusionRet
VideoActivityNetvideo-to-text R@575.3DiffusionRet
VideoDiDeMotext-to-video Mean Rank14.1DiffusionRet+QB-Norm
VideoDiDeMotext-to-video Median Rank2DiffusionRet+QB-Norm
VideoDiDeMotext-to-video R@148.9DiffusionRet+QB-Norm
VideoDiDeMotext-to-video R@1083.3DiffusionRet+QB-Norm
VideoDiDeMotext-to-video R@575.5DiffusionRet+QB-Norm
VideoDiDeMovideo-to-text Mean Rank10.3DiffusionRet+QB-Norm
VideoDiDeMovideo-to-text Median Rank1DiffusionRet+QB-Norm
VideoDiDeMovideo-to-text R@150.3DiffusionRet+QB-Norm
VideoDiDeMovideo-to-text R@1082.9DiffusionRet+QB-Norm
VideoDiDeMovideo-to-text R@575.1DiffusionRet+QB-Norm
VideoDiDeMotext-to-video Mean Rank14.3DiffusionRet
VideoDiDeMotext-to-video Median Rank2DiffusionRet
VideoDiDeMotext-to-video R@146.7DiffusionRet
VideoDiDeMotext-to-video R@1082.7DiffusionRet
VideoDiDeMotext-to-video R@574.7DiffusionRet
VideoDiDeMovideo-to-text Mean Rank10.7DiffusionRet
VideoDiDeMovideo-to-text Median Rank2DiffusionRet
VideoDiDeMovideo-to-text R@146.2DiffusionRet
VideoDiDeMovideo-to-text R@1082.2DiffusionRet
VideoDiDeMovideo-to-text R@574.3DiffusionRet
VideoLSMDCtext-to-video Mean Rank40.7DiffusionRet
VideoLSMDCtext-to-video Median Rank8DiffusionRet
VideoLSMDCtext-to-video R@124.4DiffusionRet
VideoLSMDCtext-to-video R@1054.3DiffusionRet
VideoLSMDCtext-to-video R@543.1DiffusionRet
VideoLSMDCvideo-to-text Mean Rank40.2DiffusionRet
VideoLSMDCvideo-to-text Median Rank9DiffusionRet
VideoLSMDCvideo-to-text R@123DiffusionRet
VideoLSMDCvideo-to-text R@1051.5DiffusionRet
VideoLSMDCvideo-to-text R@543.5DiffusionRet
VideoMSVDtext-to-video Mean Rank15.6DiffusionRet+QB-Norm
VideoMSVDtext-to-video R@147.9DiffusionRet+QB-Norm
VideoMSVDtext-to-video R@1084.8DiffusionRet+QB-Norm
VideoMSVDtext-to-video R@577.2DiffusionRet+QB-Norm
VideoMSVDvideo-to-text Mean Rank4.5DiffusionRet+QB-Norm
VideoMSVDvideo-to-text Median Rank1DiffusionRet+QB-Norm
VideoMSVDvideo-to-text R@160.3DiffusionRet+QB-Norm
VideoMSVDvideo-to-text R@1092DiffusionRet+QB-Norm
VideoMSVDvideo-to-text R@586.4DiffusionRet+QB-Norm
VideoMSVDtext-to-video Mean Rank15.7DiffusionRet
VideoMSVDtext-to-video Median Rank2DiffusionRet
VideoMSVDtext-to-video R@146.6DiffusionRet
VideoMSVDtext-to-video R@1084.1DiffusionRet
VideoMSVDtext-to-video R@575.9DiffusionRet
VideoMSVDvideo-to-text Mean Rank4.5DiffusionRet
VideoMSVDvideo-to-text Median Rank1DiffusionRet
VideoMSVDvideo-to-text R@161.9DiffusionRet
VideoMSVDvideo-to-text R@1092.9DiffusionRet
VideoMSVDvideo-to-text R@588.3DiffusionRet
Video RetrievalMSR-VTT-1kAtext-to-video Mean Rank12.1DiffusionRet
Video RetrievalMSR-VTT-1kAtext-to-video Median Rank2DiffusionRet
Video RetrievalMSR-VTT-1kAtext-to-video R@149DiffusionRet
Video RetrievalMSR-VTT-1kAtext-to-video R@1082.7DiffusionRet
Video RetrievalMSR-VTT-1kAtext-to-video R@575.2DiffusionRet
Video RetrievalMSR-VTT-1kAvideo-to-text Mean Rank8.8DiffusionRet
Video RetrievalMSR-VTT-1kAvideo-to-text Median Rank2DiffusionRet
Video RetrievalMSR-VTT-1kAvideo-to-text R@147.7DiffusionRet
Video RetrievalMSR-VTT-1kAvideo-to-text R@1084.5DiffusionRet
Video RetrievalMSR-VTT-1kAvideo-to-text R@573.8DiffusionRet
Video RetrievalMSR-VTT-1kAtext-to-video Mean Rank12.1DiffusionRet+QB-Norm
Video RetrievalMSR-VTT-1kAtext-to-video Median Rank2DiffusionRet+QB-Norm
Video RetrievalMSR-VTT-1kAtext-to-video R@148.9DiffusionRet+QB-Norm
Video RetrievalMSR-VTT-1kAtext-to-video R@1083.1DiffusionRet+QB-Norm
Video RetrievalMSR-VTT-1kAtext-to-video R@575.2DiffusionRet+QB-Norm
Video RetrievalMSR-VTT-1kAvideo-to-text Mean Rank8.5DiffusionRet+QB-Norm
Video RetrievalMSR-VTT-1kAvideo-to-text Median Rank2DiffusionRet+QB-Norm
Video RetrievalMSR-VTT-1kAvideo-to-text R@149.3DiffusionRet+QB-Norm
Video RetrievalMSR-VTT-1kAvideo-to-text R@1083.8DiffusionRet+QB-Norm
Video RetrievalMSR-VTT-1kAvideo-to-text R@574.3DiffusionRet+QB-Norm
Video RetrievalActivityNettext-to-video Mean Rank6.8DiffusionRet+QB-Norm
Video RetrievalActivityNettext-to-video Median Rank2DiffusionRet+QB-Norm
Video RetrievalActivityNettext-to-video R@148.1DiffusionRet+QB-Norm
Video RetrievalActivityNettext-to-video R@1085.7DiffusionRet+QB-Norm
Video RetrievalActivityNetvideo-to-text Mean Rank6.7DiffusionRet+QB-Norm
Video RetrievalActivityNetvideo-to-text Median Rank2DiffusionRet+QB-Norm
Video RetrievalActivityNetvideo-to-text R@147.4DiffusionRet+QB-Norm
Video RetrievalActivityNetvideo-to-text R@1086.7DiffusionRet+QB-Norm
Video RetrievalActivityNetvideo-to-text R@576.3DiffusionRet+QB-Norm
Video RetrievalActivityNettext-to-video Mean Rank6.5DiffusionRet
Video RetrievalActivityNettext-to-video Median Rank2DiffusionRet
Video RetrievalActivityNettext-to-video R@145.8DiffusionRet
Video RetrievalActivityNettext-to-video R@1086.3DiffusionRet
Video RetrievalActivityNettext-to-video R@575.6DiffusionRet
Video RetrievalActivityNetvideo-to-text Mean Rank6.3DiffusionRet
Video RetrievalActivityNetvideo-to-text Median Rank2DiffusionRet
Video RetrievalActivityNetvideo-to-text R@143.8DiffusionRet
Video RetrievalActivityNetvideo-to-text R@1086.7DiffusionRet
Video RetrievalActivityNetvideo-to-text R@575.3DiffusionRet
Video RetrievalDiDeMotext-to-video Mean Rank14.1DiffusionRet+QB-Norm
Video RetrievalDiDeMotext-to-video Median Rank2DiffusionRet+QB-Norm
Video RetrievalDiDeMotext-to-video R@148.9DiffusionRet+QB-Norm
Video RetrievalDiDeMotext-to-video R@1083.3DiffusionRet+QB-Norm
Video RetrievalDiDeMotext-to-video R@575.5DiffusionRet+QB-Norm
Video RetrievalDiDeMovideo-to-text Mean Rank10.3DiffusionRet+QB-Norm
Video RetrievalDiDeMovideo-to-text Median Rank1DiffusionRet+QB-Norm
Video RetrievalDiDeMovideo-to-text R@150.3DiffusionRet+QB-Norm
Video RetrievalDiDeMovideo-to-text R@1082.9DiffusionRet+QB-Norm
Video RetrievalDiDeMovideo-to-text R@575.1DiffusionRet+QB-Norm
Video RetrievalDiDeMotext-to-video Mean Rank14.3DiffusionRet
Video RetrievalDiDeMotext-to-video Median Rank2DiffusionRet
Video RetrievalDiDeMotext-to-video R@146.7DiffusionRet
Video RetrievalDiDeMotext-to-video R@1082.7DiffusionRet
Video RetrievalDiDeMotext-to-video R@574.7DiffusionRet
Video RetrievalDiDeMovideo-to-text Mean Rank10.7DiffusionRet
Video RetrievalDiDeMovideo-to-text Median Rank2DiffusionRet
Video RetrievalDiDeMovideo-to-text R@146.2DiffusionRet
Video RetrievalDiDeMovideo-to-text R@1082.2DiffusionRet
Video RetrievalDiDeMovideo-to-text R@574.3DiffusionRet
Video RetrievalLSMDCtext-to-video Mean Rank40.7DiffusionRet
Video RetrievalLSMDCtext-to-video Median Rank8DiffusionRet
Video RetrievalLSMDCtext-to-video R@124.4DiffusionRet
Video RetrievalLSMDCtext-to-video R@1054.3DiffusionRet
Video RetrievalLSMDCtext-to-video R@543.1DiffusionRet
Video RetrievalLSMDCvideo-to-text Mean Rank40.2DiffusionRet
Video RetrievalLSMDCvideo-to-text Median Rank9DiffusionRet
Video RetrievalLSMDCvideo-to-text R@123DiffusionRet
Video RetrievalLSMDCvideo-to-text R@1051.5DiffusionRet
Video RetrievalLSMDCvideo-to-text R@543.5DiffusionRet
Video RetrievalMSVDtext-to-video Mean Rank15.6DiffusionRet+QB-Norm
Video RetrievalMSVDtext-to-video R@147.9DiffusionRet+QB-Norm
Video RetrievalMSVDtext-to-video R@1084.8DiffusionRet+QB-Norm
Video RetrievalMSVDtext-to-video R@577.2DiffusionRet+QB-Norm
Video RetrievalMSVDvideo-to-text Mean Rank4.5DiffusionRet+QB-Norm
Video RetrievalMSVDvideo-to-text Median Rank1DiffusionRet+QB-Norm
Video RetrievalMSVDvideo-to-text R@160.3DiffusionRet+QB-Norm
Video RetrievalMSVDvideo-to-text R@1092DiffusionRet+QB-Norm
Video RetrievalMSVDvideo-to-text R@586.4DiffusionRet+QB-Norm
Video RetrievalMSVDtext-to-video Mean Rank15.7DiffusionRet
Video RetrievalMSVDtext-to-video Median Rank2DiffusionRet
Video RetrievalMSVDtext-to-video R@146.6DiffusionRet
Video RetrievalMSVDtext-to-video R@1084.1DiffusionRet
Video RetrievalMSVDtext-to-video R@575.9DiffusionRet
Video RetrievalMSVDvideo-to-text Mean Rank4.5DiffusionRet
Video RetrievalMSVDvideo-to-text Median Rank1DiffusionRet
Video RetrievalMSVDvideo-to-text R@161.9DiffusionRet
Video RetrievalMSVDvideo-to-text R@1092.9DiffusionRet
Video RetrievalMSVDvideo-to-text R@588.3DiffusionRet

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16Context-Aware Search and Retrieval Over Erasure Channels2025-07-16Seq vs Seq: An Open Suite of Paired Encoders and Decoders2025-07-15