TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/DiffusionVMR: Diffusion Model for Joint Video Moment Retri...

DiffusionVMR: Diffusion Model for Joint Video Moment Retrieval and Highlight Detection

Henghao Zhao, Kevin Qinghong Lin, Rui Yan, Zechao Li

2023-08-29DenoisingVideo GroundingHighlight DetectionMoment RetrievalRetrievalobject-detectionObject Detection
PaperPDF

Abstract

Video moment retrieval and highlight detection have received attention in the current era of video content proliferation, aiming to localize moments and estimate clip relevances based on user-specific queries. Given that the video content is continuous in time, there is often a lack of clear boundaries between temporal events in a video. This boundary ambiguity makes it challenging for the model to learn text-video clip correspondences, resulting in the subpar performance of existing methods in predicting target segments. To alleviate this problem, we propose to solve the two tasks jointly from the perspective of denoising generation. Moreover, the target boundary can be localized clearly by iterative refinement from coarse to fine. Specifically, a novel framework, DiffusionVMR, is proposed to redefine the two tasks as a unified conditional denoising generation process by combining the diffusion model. During training, Gaussian noise is added to corrupt the ground truth, with noisy candidates produced as input. The model is trained to reverse this noise addition process. In the inference phase, DiffusionVMR initiates directly from Gaussian noise and progressively refines the proposals from the noise to the meaningful output. Notably, the proposed DiffusionVMR inherits the advantages of diffusion models that allow for iteratively refined results during inference, enhancing the boundary transition from coarse to fine. Furthermore, the training and inference of DiffusionVMR are decoupled. An arbitrary setting can be used in DiffusionVMR during inference without consistency with the training phase. Extensive experiments conducted on five widely-used benchmarks (i.e., QVHighlight, Charades-STA, TACoS, YouTubeHighlights and TVSum) across two tasks (moment retrieval and/or highlight detection) demonstrate the effectiveness and flexibility of the proposed DiffusionVMR.

Results

TaskDatasetMetricValueModel
VideoQVHighlightsR@1,IoU=0.561.61DiffusionVMR
VideoQVHighlightsR@1,IoU=0.744.49DiffusionVMR
Video RetrievalQVHighlightsR@1,IoU=0.561.61DiffusionVMR
Video RetrievalQVHighlightsR@1,IoU=0.744.49DiffusionVMR
Video GroundingQVHighlightsR@1,IoU=0.561.61DiffusionVMR
Video GroundingQVHighlightsR@1,IoU=0.744.49DiffusionVMR

Related Papers

fastWDM3D: Fast and Accurate 3D Healthy Tissue Inpainting2025-07-17Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models2025-07-17VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17