UniMD: Towards Unifying Moment Retrieval and Temporal Action Detection

Yingsen Zeng, Yujie Zhong, Chengjian Feng, Lin Ma

2024-04-07Action Detection Moment Queries Moment Retrieval Natural Language Queries Temporal Action Localization Natural Language Moment Retrieval

Paper PDF Code(official)

Abstract

Temporal Action Detection (TAD) focuses on detecting pre-defined actions, while Moment Retrieval (MR) aims to identify the events described by open-ended natural language within untrimmed videos. Despite that they focus on different events, we observe they have a significant connection. For instance, most descriptions in MR involve multiple actions from TAD. In this paper, we aim to investigate the potential synergy between TAD and MR. Firstly, we propose a unified architecture, termed Unified Moment Detection (UniMD), for both TAD and MR. It transforms the inputs of the two tasks, namely actions for TAD or events for MR, into a common embedding space, and utilizes two novel query-dependent decoders to generate a uniform output of classification score and temporal segments. Secondly, we explore the efficacy of two task fusion learning approaches, pre-training and co-training, in order to enhance the mutual benefits between TAD and MR. Extensive experiments demonstrate that the proposed task fusion learning scheme enables the two tasks to help each other and outperform the separately trained counterparts. Impressively, UniMD achieves state-of-the-art results on three paired datasets Ego4D, Charades-STA, and ActivityNet. Our code is available at https://github.com/yingsen1/UniMD.

Results

Task	Dataset	Metric	Value	Model
Video	ActivityNet-1.3	mAP	39.83	UniMD+Sync.
Video	ActivityNet-1.3	mAP IOU@0.5	60.29	UniMD+Sync.
Video	ActivityNet Captions	R@5,IoU=0.5	80.54	UniMD+Sync.
Video	ActivityNet Captions	R@5,IoU=0.7	57.04	UniMD+Sync.
Temporal Action Localization	ActivityNet-1.3	mAP	39.83	UniMD+Sync.
Temporal Action Localization	ActivityNet-1.3	mAP IOU@0.5	60.29	UniMD+Sync.
Zero-Shot Learning	ActivityNet-1.3	mAP	39.83	UniMD+Sync.
Zero-Shot Learning	ActivityNet-1.3	mAP IOU@0.5	60.29	UniMD+Sync.
Action Localization	ActivityNet-1.3	mAP	39.83	UniMD+Sync.
Action Localization	ActivityNet-1.3	mAP IOU@0.5	60.29	UniMD+Sync.
Action Detection	Charades	mAP	26.53	UniMD+Sync. (RGB+Flow)
Moment Retrieval	Charades-STA	R@1 IoU=0.5	63.98	UniMD+Sync.
Moment Retrieval	Charades-STA	R@1 IoU=0.7	44.46	UniMD+Sync.
Moment Retrieval	Charades-STA	R@5 IoU=0.5	91.94	UniMD+Sync.
Moment Retrieval	Charades-STA	R@5 IoU=0.7	67.72	UniMD+Sync.
Natural Language Queries	Ego4D	R@1 IoU=0.3	14.16	UniMD+Sync.
Natural Language Queries	Ego4D	R@1 IoU=0.5	10.06	UniMD+Sync.
Natural Language Queries	Ego4D	R@1 Mean(0.3 and 0.5)	12.11	UniMD+Sync.
Natural Language Queries	Ego4D	R@5 IoU=0.3	26.95	UniMD+Sync.
Natural Language Queries	Ego4D	R@5 IoU=0.5	19.16	UniMD+Sync.

UniMD: Towards Unifying Moment Retrieval and Temporal Action Detection

Abstract

Results

Related Papers

UniMD: Towards Unifying Moment Retrieval and Temporal Action Detection

Abstract

Results

Related Papers