TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/UniMD: Towards Unifying Moment Retrieval and Temporal Acti...

UniMD: Towards Unifying Moment Retrieval and Temporal Action Detection

Yingsen Zeng, Yujie Zhong, Chengjian Feng, Lin Ma

2024-04-07Action DetectionMoment QueriesMoment RetrievalNatural Language QueriesTemporal Action LocalizationNatural Language Moment Retrieval
PaperPDFCode(official)

Abstract

Temporal Action Detection (TAD) focuses on detecting pre-defined actions, while Moment Retrieval (MR) aims to identify the events described by open-ended natural language within untrimmed videos. Despite that they focus on different events, we observe they have a significant connection. For instance, most descriptions in MR involve multiple actions from TAD. In this paper, we aim to investigate the potential synergy between TAD and MR. Firstly, we propose a unified architecture, termed Unified Moment Detection (UniMD), for both TAD and MR. It transforms the inputs of the two tasks, namely actions for TAD or events for MR, into a common embedding space, and utilizes two novel query-dependent decoders to generate a uniform output of classification score and temporal segments. Secondly, we explore the efficacy of two task fusion learning approaches, pre-training and co-training, in order to enhance the mutual benefits between TAD and MR. Extensive experiments demonstrate that the proposed task fusion learning scheme enables the two tasks to help each other and outperform the separately trained counterparts. Impressively, UniMD achieves state-of-the-art results on three paired datasets Ego4D, Charades-STA, and ActivityNet. Our code is available at https://github.com/yingsen1/UniMD.

Results

TaskDatasetMetricValueModel
VideoActivityNet-1.3mAP39.83UniMD+Sync.
VideoActivityNet-1.3mAP IOU@0.560.29UniMD+Sync.
VideoActivityNet CaptionsR@5,IoU=0.580.54UniMD+Sync.
VideoActivityNet CaptionsR@5,IoU=0.757.04UniMD+Sync.
Temporal Action LocalizationActivityNet-1.3mAP39.83UniMD+Sync.
Temporal Action LocalizationActivityNet-1.3mAP IOU@0.560.29UniMD+Sync.
Zero-Shot LearningActivityNet-1.3mAP39.83UniMD+Sync.
Zero-Shot LearningActivityNet-1.3mAP IOU@0.560.29UniMD+Sync.
Action LocalizationActivityNet-1.3mAP39.83UniMD+Sync.
Action LocalizationActivityNet-1.3mAP IOU@0.560.29UniMD+Sync.
Action DetectionCharadesmAP26.53UniMD+Sync. (RGB+Flow)
Moment RetrievalCharades-STAR@1 IoU=0.563.98UniMD+Sync.
Moment RetrievalCharades-STAR@1 IoU=0.744.46UniMD+Sync.
Moment RetrievalCharades-STAR@5 IoU=0.591.94UniMD+Sync.
Moment RetrievalCharades-STAR@5 IoU=0.767.72UniMD+Sync.
Natural Language QueriesEgo4DR@1 IoU=0.314.16UniMD+Sync.
Natural Language QueriesEgo4DR@1 IoU=0.510.06UniMD+Sync.
Natural Language QueriesEgo4DR@1 Mean(0.3 and 0.5)12.11UniMD+Sync.
Natural Language QueriesEgo4DR@5 IoU=0.326.95UniMD+Sync.
Natural Language QueriesEgo4DR@5 IoU=0.519.16UniMD+Sync.

Related Papers

DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16SPAZER: Spatial-Semantic Progressive Reasoning Agent for Zero-shot 3D Visual Grounding2025-06-27CBF-AFA: Chunk-Based Multi-SSL Fusion for Automatic Fluency Assessment2025-06-25MultiHuman-Testbench: Benchmarking Image Generation for Multiple Humans2025-06-25A Modular Multitask Reasoning Framework Integrating Spatio-temporal Models and LLMs2025-06-25Towards Probabilistic Question Answering Over Tabular Data2025-06-25Including Semantic Information via Word Embeddings for Skeleton-based Action Recognition2025-06-23Distributed Activity Detection for Cell-Free Hybrid Near-Far Field Communications2025-06-17