TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Context-Guided Spatio-Temporal Video Grounding

Context-Guided Spatio-Temporal Video Grounding

Xin Gu, Heng Fan, Yan Huang, Tiejian Luo, Libo Zhang

2024-01-03CVPR 2024 1Video GroundingSpatio-Temporal Video Grounding
PaperPDFCode(official)Code

Abstract

Spatio-temporal video grounding (or STVG) task aims at locating a spatio-temporal tube for a specific instance given a text query. Despite advancements, current methods easily suffer the distractors or heavy object appearance variations in videos due to insufficient object information from the text, leading to degradation. Addressing this, we propose a novel framework, context-guided STVG (CG-STVG), which mines discriminative instance context for object in videos and applies it as a supplementary guidance for target localization. The key of CG-STVG lies in two specially designed modules, including instance context generation (ICG), which focuses on discovering visual context information (in both appearance and motion) of the instance, and instance context refinement (ICR), which aims to improve the instance context from ICG by eliminating irrelevant or even harmful information from the context. During grounding, ICG, together with ICR, are deployed at each decoding stage of a Transformer architecture for instance context learning. Particularly, instance context learned from one decoding stage is fed to the next stage, and leveraged as a guidance containing rich and discriminative object feature to enhance the target-awareness in decoding feature, which conversely benefits generating better new instance context for improving localization finally. Compared to existing methods, CG-STVG enjoys object information in text query and guidance from mined instance visual context for more accurate target localization. In our experiments on three benchmarks, including HCSTVG-v1/-v2 and VidSTG, CG-STVG sets new state-of-the-arts in m_tIoU and m_vIoU on all of them, showing its efficacy. The code will be released at https://github.com/HengLan/CGSTVG.

Results

TaskDatasetMetricValueModel
Spatio-Temporal Video GroundingVidSTGDeclarative m_vIoU34CG-STVG
Spatio-Temporal Video GroundingVidSTGDeclarative vIoU@0.347.7CG-STVG
Spatio-Temporal Video GroundingVidSTGDeclarative vIoU@0.533.1CG-STVG
Spatio-Temporal Video GroundingVidSTGInterrogative m_vIoU29CG-STVG
Spatio-Temporal Video GroundingVidSTGInterrogative vIoU@0.340.5CG-STVG
Spatio-Temporal Video GroundingVidSTGInterrogative vIoU@0.527.5CG-STVG
Spatio-Temporal Video GroundingHC-STVG1m_vIoU38.4CG-STVG
Spatio-Temporal Video GroundingHC-STVG1vIoU@0.361.5CG-STVG
Spatio-Temporal Video GroundingHC-STVG1vIoU@0.536.3CG-STVG
Spatio-Temporal Video GroundingHC-STVG2Val m_vIoU39.5CG-STVG
Spatio-Temporal Video GroundingHC-STVG2Val vIoU@0.364.5CG-STVG
Spatio-Temporal Video GroundingHC-STVG2Val vIoU@0.536.3CG-STVG

Related Papers

VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17Reinforcement Learning Tuning for VideoLLMs: Reward Design and Data Efficiency2025-06-02SAMA: Towards Multi-Turn Referential Grounded Video Chat with Large Language Models2025-05-24DeCafNet: Delegate and Conquer for Efficient Temporal Grounding in Long Videos2025-05-22Object-Shot Enhanced Grounding Network for Egocentric Video2025-05-07Enhancing Weakly Supervised Video Grounding via Diverse Inference Strategies for Boundary and Prediction Selection2025-03-29VideoGEM: Training-free Action Grounding in Videos2025-03-26SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability2025-03-18