TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Bridging the Gap: A Unified Video Comprehension Framework ...

Bridging the Gap: A Unified Video Comprehension Framework for Moment Retrieval and Highlight Detection

Yicheng Xiao, Zhuoyan Luo, Yong liu, Yue Ma, Hengwei Bian, Yatai Ji, Yujiu Yang, Xiu Li

2023-11-28CVPR 2024 1Video GroundingHighlight DetectionContrastive LearningMoment RetrievalRetrievalTemporal Action LocalizationNatural Language Moment Retrieval
PaperPDFCode(official)

Abstract

Video Moment Retrieval (MR) and Highlight Detection (HD) have attracted significant attention due to the growing demand for video analysis. Recent approaches treat MR and HD as similar video grounding problems and address them together with transformer-based architecture. However, we observe that the emphasis of MR and HD differs, with one necessitating the perception of local relationships and the other prioritizing the understanding of global contexts. Consequently, the lack of task-specific design will inevitably lead to limitations in associating the intrinsic specialty of two tasks. To tackle the issue, we propose a Unified Video COMprehension framework (UVCOM) to bridge the gap and jointly solve MR and HD effectively. By performing progressive integration on intra and inter-modality across multi-granularity, UVCOM achieves the comprehensive understanding in processing a video. Moreover, we present multi-aspect contrastive learning to consolidate the local relation modeling and global knowledge accumulation via well aligned multi-modal space. Extensive experiments on QVHighlights, Charades-STA, TACoS , YouTube Highlights and TVSum datasets demonstrate the effectiveness and rationality of UVCOM which outperforms the state-of-the-art methods by a remarkable margin.

Results

TaskDatasetMetricValueModel
VideoTACoSR@1,IoU=0.536.39UVCOM
VideoTACoSR@1,IoU=0.723.32UVCOM
Moment RetrievalCharades-STAR@1 IoU=0.559.25UVCOM
Moment RetrievalCharades-STAR@1 IoU=0.736.64UVCOM
Moment RetrievalQVHighlightsR@1 IoU=0.564.53UVCOM (w/ PT ASR Captions)
Moment RetrievalQVHighlightsR@1 IoU=0.748.31UVCOM (w/ PT ASR Captions)
Moment RetrievalQVHighlightsmAP43.8UVCOM (w/ PT ASR Captions)
Moment RetrievalQVHighlightsmAP@0.564.78UVCOM (w/ PT ASR Captions)
Moment RetrievalQVHighlightsmAP@0.7543.65UVCOM (w/ PT ASR Captions)
Moment RetrievalQVHighlightsR@1 IoU=0.563.55UVCOM
Moment RetrievalQVHighlightsR@1 IoU=0.747.47UVCOM
Moment RetrievalQVHighlightsmAP43.18UVCOM
Moment RetrievalQVHighlightsmAP@0.563.37UVCOM
Moment RetrievalQVHighlightsmAP@0.7542.67UVCOM
Highlight DetectionTvSummAP86.3UVCOM (train from scratch)
Highlight DetectionYouTube HighlightsmAP77.4UVCOM
16kTvSummAP86.3UVCOM (train from scratch)
16kYouTube HighlightsmAP77.4UVCOM

Related Papers

VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management2025-07-17SGCL: Unifying Self-Supervised and Supervised Learning for Graph Recommendation2025-07-17From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17