TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/VideoLights: Feature Refinement and Cross-Task Alignment T...

VideoLights: Feature Refinement and Cross-Task Alignment Transformer for Joint Video Highlight Detection and Moment Retrieval

Dhiman Paul, Md Rizwan Parvez, Nabeel Mohammed, Shafin Rahman

2024-12-02Highlight DetectionMoment Retrieval
PaperPDFCode(official)

Abstract

Video Highlight Detection and Moment Retrieval (HD/MR) are essential in video analysis. Recent joint prediction transformer models often overlook their cross-task dynamics and video-text alignment and refinement. Moreover, most models typically use limited, uni-directional attention mechanisms, resulting in weakly integrated representations and suboptimal performance in capturing the interdependence between video and text modalities. Although large-language and vision-language models (LLM/LVLMs) have gained prominence across various domains, their application in this field remains relatively underexplored. Here we propose VideoLights, a novel HD/MR framework addressing these limitations through (i) Convolutional Projection and Feature Refinement modules with an alignment loss for better video-text feature alignment, (ii) Bi-Directional Cross-Modal Fusion network for strongly coupled query-aware clip representations, and (iii) Uni-directional joint-task feedback mechanism enhancing both tasks through correlation. In addition, (iv) we introduce hard positive/negative losses for adaptive error penalization and improved learning, and (v) leverage LVLMs like BLIP-2 for enhanced multimodal feature integration and intelligent pretraining using synthetic data generated from LVLMs. Comprehensive experiments on QVHighlights, TVSum, and Charades-STA benchmarks demonstrate state-of-the-art performance. Codes and models are available at https://github.com/dpaul06/VideoLights .

Results

TaskDatasetMetricValueModel
Moment RetrievalCharades-STAR@1 IoU=0.373.33VideoLights-B-pt
Moment RetrievalCharades-STAR@1 IoU=0.561.96VideoLights-B-pt
Moment RetrievalCharades-STAR@1 IoU=0.741.05VideoLights-B-pt
Moment RetrievalCharades-STAmIoU52.94VideoLights-B-pt
Moment RetrievalQVHighlightsR@1 IoU=0.570.36VideoLights-B-pt
Moment RetrievalQVHighlightsR@1 IoU=0.755.25VideoLights-B-pt
Moment RetrievalQVHighlightsmAP47.94VideoLights-B-pt
Moment RetrievalQVHighlightsmAP@0.569.53VideoLights-B-pt
Moment RetrievalQVHighlightsmAP@0.7549.17VideoLights-B-pt
Highlight DetectionQVHighlightsHit@170.56VideoLights-B-pt
Highlight DetectionQVHighlightsmAP42.84VideoLights-B-pt
16kQVHighlightsHit@170.56VideoLights-B-pt
16kQVHighlightsmAP42.84VideoLights-B-pt

Related Papers

DeSPITE: Exploring Contrastive Deep Skeleton-Pointcloud-IMU-Text Embeddings for Advanced Point Cloud Human Activity Understanding2025-06-16Unsupervised Transcript-assisted Video Summarization and Highlight Detection2025-05-29Rhapsody: A Dataset for Highlight Detection in Podcasts2025-05-26Gameplay Highlights Generation2025-05-12Retrieval Augmented Generation Evaluation for Health Documents2025-05-07TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action2025-05-02Grounding-MD: Grounded Video-language Pre-training for Open-World Moment Detection2025-04-20Automatic Detection of Intro and Credits in Video using CLIP and Multihead Attention2025-04-13