TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/QVHighlights: Detecting Moments and Highlights in Videos v...

QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries

Jie Lei, Tamara L. Berg, Mohit Bansal

2021-07-20Highlight DetectionMoment RetrievalRetrievalNatural Language Queries
PaperPDFCodeCodeCode(official)Code

Abstract

Detecting customized moments and highlights from videos given natural language (NL) user queries is an important but under-studied topic. One of the challenges in pursuing this direction is the lack of annotated data. To address this issue, we present the Query-based Video Highlights (QVHIGHLIGHTS) dataset. It consists of over 10,000 YouTube videos, covering a wide range of topics, from everyday activities and travel in lifestyle vlog videos to social and political activities in news videos. Each video in the dataset is annotated with: (1) a human-written free-form NL query, (2) relevant moments in the video w.r.t. the query, and (3) five-point scale saliency scores for all query-relevant clips. This comprehensive annotation enables us to develop and evaluate systems that detect relevant moments as well as salient highlights for diverse, flexible user queries. We also present a strong baseline for this task, Moment-DETR, a transformer encoder-decoder model that views moment retrieval as a direct set prediction problem, taking extracted video and query representations as inputs and predicting moment coordinates and saliency scores end-to-end. While our model does not utilize any human prior, we show that it performs competitively when compared to well-engineered architectures. With weakly supervised pretraining using ASR captions, MomentDETR substantially outperforms previous methods. Lastly, we present several ablations and visualizations of Moment-DETR. Data and code is publicly available at https://github.com/jayleicn/moment_detr

Results

TaskDatasetMetricValueModel
Moment RetrievalCharades-STAR@1 IoU=0.555.65Moment-DETR w/ PT (on 10K HowTo100M videos)
Moment RetrievalCharades-STAR@1 IoU=0.734.17Moment-DETR w/ PT (on 10K HowTo100M videos)
Moment RetrievalCharades-STAR@1 IoU=0.553.63Moment-DETR
Moment RetrievalCharades-STAR@1 IoU=0.731.37Moment-DETR
Moment RetrievalQVHighlightsR@1 IoU=0.559.78Moment-DETR (w/ PT ASR Cpations)
Moment RetrievalQVHighlightsR@1 IoU=0.740.33Moment-DETR (w/ PT ASR Cpations)
Moment RetrievalQVHighlightsmAP36.14Moment-DETR (w/ PT ASR Cpations)
Moment RetrievalQVHighlightsmAP@0.560.51Moment-DETR (w/ PT ASR Cpations)
Moment RetrievalQVHighlightsmAP@0.7535.36Moment-DETR (w/ PT ASR Cpations)
Highlight DetectionQVHighlightsHit@160.17Moment-DETR w/ PT
Highlight DetectionQVHighlightsmAP37.43Moment-DETR w/ PT
16kQVHighlightsHit@160.17Moment-DETR w/ PT
16kQVHighlightsmAP37.43Moment-DETR w/ PT

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16Context-Aware Search and Retrieval Over Erasure Channels2025-07-16Seq vs Seq: An Open Suite of Paired Encoders and Decoders2025-07-15