TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/RGNet: A Unified Clip Retrieval and Grounding Network for ...

RGNet: A Unified Clip Retrieval and Grounding Network for Long Videos

Tanveer Hannan, Md Mohaiminul Islam, Thomas Seidl, Gedas Bertasius

2023-12-11Video-Text RetrievalText RetrievalRetrievalNatural Language QueriesNatural Language Moment Retrieval
PaperPDFCodeCode(official)

Abstract

Locating specific moments within long videos (20-120 minutes) presents a significant challenge, akin to finding a needle in a haystack. Adapting existing short video (5-30 seconds) grounding methods to this problem yields poor performance. Since most real life videos, such as those on YouTube and AR/VR, are lengthy, addressing this issue is crucial. Existing methods typically operate in two stages: clip retrieval and grounding. However, this disjoint process limits the retrieval module's fine-grained event understanding, crucial for specific moment detection. We propose RGNet which deeply integrates clip retrieval and grounding into a single network capable of processing long videos into multiple granular levels, e.g., clips and frames. Its core component is a novel transformer encoder, RG-Encoder, that unifies the two stages through shared features and mutual optimization. The encoder incorporates a sparse attention mechanism and an attention loss to model both granularity jointly. Moreover, we introduce a contrastive clip sampling technique to mimic the long video paradigm closely during training. RGNet surpasses prior methods, showcasing state-of-the-art performance on long video temporal grounding (LVTG) datasets MAD and Ego4D.

Results

TaskDatasetMetricValueModel
VideoMADR@1,IoU=0.112.43RGNet
VideoMADR@1,IoU=0.39.48RGNet
VideoMADR@1,IoU=0.55.61RGNet
VideoMADR@5,IoU=0.125.12RGNet
VideoMADR@5,IoU=0.318.72RGNet
VideoMADR@5,IoU=0.510.86RGNet
Natural Language QueriesEgo4DR@1 IoU=0.320.63RGNet
Natural Language QueriesEgo4DR@1 IoU=0.512.47RGNet
Natural Language QueriesEgo4DR@1 Mean(0.3 and 0.5)16.55RGNet
Natural Language QueriesEgo4DR@5 IoU=0.341.67RGNet
Natural Language QueriesEgo4DR@5 IoU=0.525.08RGNet

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16Context-Aware Search and Retrieval Over Erasure Channels2025-07-16Seq vs Seq: An Open Suite of Paired Encoders and Decoders2025-07-15