Jie Lei, Tamara L. Berg, Mohit Bansal
Detecting customized moments and highlights from videos given natural language (NL) user queries is an important but under-studied topic. One of the challenges in pursuing this direction is the lack of annotated data. To address this issue, we present the Query-based Video Highlights (QVHIGHLIGHTS) dataset. It consists of over 10,000 YouTube videos, covering a wide range of topics, from everyday activities and travel in lifestyle vlog videos to social and political activities in news videos. Each video in the dataset is annotated with: (1) a human-written free-form NL query, (2) relevant moments in the video w.r.t. the query, and (3) five-point scale saliency scores for all query-relevant clips. This comprehensive annotation enables us to develop and evaluate systems that detect relevant moments as well as salient highlights for diverse, flexible user queries. We also present a strong baseline for this task, Moment-DETR, a transformer encoder-decoder model that views moment retrieval as a direct set prediction problem, taking extracted video and query representations as inputs and predicting moment coordinates and saliency scores end-to-end. While our model does not utilize any human prior, we show that it performs competitively when compared to well-engineered architectures. With weakly supervised pretraining using ASR captions, MomentDETR substantially outperforms previous methods. Lastly, we present several ablations and visualizations of Moment-DETR. Data and code is publicly available at https://github.com/jayleicn/moment_detr
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Moment Retrieval | Charades-STA | R@1 IoU=0.5 | 55.65 | Moment-DETR w/ PT (on 10K HowTo100M videos) |
| Moment Retrieval | Charades-STA | R@1 IoU=0.7 | 34.17 | Moment-DETR w/ PT (on 10K HowTo100M videos) |
| Moment Retrieval | Charades-STA | R@1 IoU=0.5 | 53.63 | Moment-DETR |
| Moment Retrieval | Charades-STA | R@1 IoU=0.7 | 31.37 | Moment-DETR |
| Moment Retrieval | QVHighlights | R@1 IoU=0.5 | 59.78 | Moment-DETR (w/ PT ASR Cpations) |
| Moment Retrieval | QVHighlights | R@1 IoU=0.7 | 40.33 | Moment-DETR (w/ PT ASR Cpations) |
| Moment Retrieval | QVHighlights | mAP | 36.14 | Moment-DETR (w/ PT ASR Cpations) |
| Moment Retrieval | QVHighlights | mAP@0.5 | 60.51 | Moment-DETR (w/ PT ASR Cpations) |
| Moment Retrieval | QVHighlights | mAP@0.75 | 35.36 | Moment-DETR (w/ PT ASR Cpations) |
| Highlight Detection | QVHighlights | Hit@1 | 60.17 | Moment-DETR w/ PT |
| Highlight Detection | QVHighlights | mAP | 37.43 | Moment-DETR w/ PT |
| 16k | QVHighlights | Hit@1 | 60.17 | Moment-DETR w/ PT |
| 16k | QVHighlights | mAP | 37.43 | Moment-DETR w/ PT |