WonJun Moon, Sangeek Hyun, Sanguk Park, Dongchan Park, Jae-Pil Heo
Recently, video moment retrieval and highlight detection (MR/HD) are being spotlighted as the demand for video understanding is drastically increased. The key objective of MR/HD is to localize the moment and estimate clip-wise accordance level, i.e., saliency score, to the given text query. Although the recent transformer-based models brought some advances, we found that these methods do not fully exploit the information of a given query. For example, the relevance between text query and video contents is sometimes neglected when predicting the moment and its saliency. To tackle this issue, we introduce Query-Dependent DETR (QD-DETR), a detection transformer tailored for MR/HD. As we observe the insignificant role of a given query in transformer architectures, our encoding module starts with cross-attention layers to explicitly inject the context of text query into video representation. Then, to enhance the model's capability of exploiting the query information, we manipulate the video-query pairs to produce irrelevant pairs. Such negative (irrelevant) video-query pairs are trained to yield low saliency scores, which in turn, encourages the model to estimate precise accordance between query-video pairs. Lastly, we present an input-adaptive saliency predictor which adaptively defines the criterion of saliency scores for the given video-query pairs. Our extensive studies verify the importance of building the query-dependent representation for MR/HD. Specifically, QD-DETR outperforms state-of-the-art methods on QVHighlights, TVSum, and Charades-STA datasets. Codes are available at github.com/wjun0830/QD-DETR.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | QVHighlights | R@1,IoU=0.5 | 62.4 | QD-DETR |
| Video | QVHighlights | R@1,IoU=0.7 | 44.98 | QD-DETR |
| Video Retrieval | QVHighlights | R@1,IoU=0.5 | 62.4 | QD-DETR |
| Video Retrieval | QVHighlights | R@1,IoU=0.7 | 44.98 | QD-DETR |
| Moment Retrieval | Charades-STA | R@1 IoU=0.5 | 57.31 | QD-DETR (Only Video) |
| Moment Retrieval | Charades-STA | R@1 IoU=0.7 | 32.55 | QD-DETR (Only Video) |
| Moment Retrieval | QVHighlights | R@1 IoU=0.5 | 64.1 | QD-DETR (w/ PT) |
| Moment Retrieval | QVHighlights | R@1 IoU=0.7 | 46.1 | QD-DETR (w/ PT) |
| Moment Retrieval | QVHighlights | mAP | 40.62 | QD-DETR (w/ PT) |
| Moment Retrieval | QVHighlights | mAP@0.5 | 64.3 | QD-DETR (w/ PT) |
| Moment Retrieval | QVHighlights | mAP@0.75 | 40.5 | QD-DETR (w/ PT) |
| Moment Retrieval | QVHighlights | R@1 IoU=0.5 | 63.06 | QD-DETR (w/ audio) |
| Moment Retrieval | QVHighlights | R@1 IoU=0.7 | 45.1 | QD-DETR (w/ audio) |
| Moment Retrieval | QVHighlights | mAP | 40.19 | QD-DETR (w/ audio) |
| Moment Retrieval | QVHighlights | mAP@0.5 | 63.04 | QD-DETR (w/ audio) |
| Moment Retrieval | QVHighlights | mAP@0.75 | 40.1 | QD-DETR (w/ audio) |
| Moment Retrieval | QVHighlights | R@1 IoU=0.5 | 63.2 | QD-DETR (only Video w/ PT ASR Captions) |
| Moment Retrieval | QVHighlights | R@1 IoU=0.7 | 45.2 | QD-DETR (only Video w/ PT ASR Captions) |
| Moment Retrieval | QVHighlights | mAP | 40 | QD-DETR (only Video w/ PT ASR Captions) |
| Moment Retrieval | QVHighlights | mAP@0.5 | 63.4 | QD-DETR (only Video w/ PT ASR Captions) |
| Moment Retrieval | QVHighlights | mAP@0.75 | 40.4 | QD-DETR (only Video w/ PT ASR Captions) |
| Moment Retrieval | QVHighlights | R@1 IoU=0.5 | 62.4 | QD-DETR (only Video) |
| Moment Retrieval | QVHighlights | R@1 IoU=0.7 | 44.98 | QD-DETR (only Video) |
| Moment Retrieval | QVHighlights | mAP | 39.86 | QD-DETR (only Video) |
| Moment Retrieval | QVHighlights | mAP@0.5 | 62.52 | QD-DETR (only Video) |
| Moment Retrieval | QVHighlights | mAP@0.75 | 39.88 | QD-DETR (only Video) |
| Highlight Detection | TvSum | mAP | 86.6 | QD-DETR |
| Highlight Detection | TvSum | mAP | 85 | QD-DETR (only Video) |
| Highlight Detection | QVHighlights | Hit@1 | 62.87 | QD-DETR |
| Highlight Detection | QVHighlights | mAP | 39.04 | QD-DETR |
| Highlight Detection | QVHighlights | Hit@1 | 62.4 | QD-DETR (only Video) |
| Highlight Detection | QVHighlights | mAP | 38.94 | QD-DETR (only Video) |
| Highlight Detection | QVHighlights | Hit@1 | 62.27 | QD-DETR (w/ PT) |
| Highlight Detection | QVHighlights | mAP | 38.52 | QD-DETR (w/ PT) |
| Highlight Detection | QVHighlights | Hit@1 | 61.91 | QD-DETR (only Video w/ PT) |
| Video Grounding | QVHighlights | R@1,IoU=0.5 | 62.4 | QD-DETR |
| Video Grounding | QVHighlights | R@1,IoU=0.7 | 44.98 | QD-DETR |
| 16k | TvSum | mAP | 86.6 | QD-DETR |
| 16k | TvSum | mAP | 85 | QD-DETR (only Video) |
| 16k | QVHighlights | Hit@1 | 62.87 | QD-DETR |
| 16k | QVHighlights | mAP | 39.04 | QD-DETR |
| 16k | QVHighlights | Hit@1 | 62.4 | QD-DETR (only Video) |
| 16k | QVHighlights | mAP | 38.94 | QD-DETR (only Video) |
| 16k | QVHighlights | Hit@1 | 62.27 | QD-DETR (w/ PT) |
| 16k | QVHighlights | mAP | 38.52 | QD-DETR (w/ PT) |
| 16k | QVHighlights | Hit@1 | 61.91 | QD-DETR (only Video w/ PT) |