TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Knowing Your Target: Target-Aware Transformer Makes Better...

Knowing Your Target: Target-Aware Transformer Makes Better Spatio-Temporal Video Grounding

Xin Gu, Yaojie Shen, Chenxi Luo, Tiejian Luo, Yan Huang, Yuewei Lin, Heng Fan, Libo Zhang

2025-02-16AttributeVideo GroundingSpatio-Temporal Video GroundingTemporal Localization
PaperPDFCode(official)

Abstract

Transformer has attracted increasing interest in STVG, owing to its end-to-end pipeline and promising result. Existing Transformer-based STVG approaches often leverage a set of object queries, which are initialized simply using zeros and then gradually learn target position information via iterative interactions with multimodal features, for spatial and temporal localization. Despite simplicity, these zero object queries, due to lacking target-specific cues, are hard to learn discriminative target information from interactions with multimodal features in complicated scenarios (\e.g., with distractors or occlusion), resulting in degradation. Addressing this, we introduce a novel Target-Aware Transformer for STVG (TA-STVG), which seeks to adaptively generate object queries via exploring target-specific cues from the given video-text pair, for improving STVG. The key lies in two simple yet effective modules, comprising text-guided temporal sampling (TTS) and attribute-aware spatial activation (ASA), working in a cascade. The former focuses on selecting target-relevant temporal cues from a video utilizing holistic text information, while the latter aims at further exploiting the fine-grained visual attribute information of the object from previous target-aware temporal cues, which is applied for object query initialization. Compared to existing methods leveraging zero-initialized queries, object queries in our TA-STVG, directly generated from a given video-text pair, naturally carry target-specific cues, making them adaptive and better interact with multimodal features for learning more discriminative information to improve STVG. In our experiments on three benchmarks, TA-STVG achieves state-of-the-art performance and significantly outperforms the baseline, validating its efficacy.

Results

TaskDatasetMetricValueModel
Spatio-Temporal Video GroundingVidSTGDeclarative m_vIoU34.4TA-STVG
Spatio-Temporal Video GroundingVidSTGDeclarative vIoU@0.348.2TA-STVG
Spatio-Temporal Video GroundingVidSTGDeclarative vIoU@0.533.5TA-STVG
Spatio-Temporal Video GroundingVidSTGInterrogative m_vIoU29.5TA-STVG
Spatio-Temporal Video GroundingVidSTGInterrogative vIoU@0.341.5TA-STVG
Spatio-Temporal Video GroundingVidSTGInterrogative vIoU@0.528TA-STVG
Spatio-Temporal Video GroundingHC-STVG1m_vIoU39.1TA-STVG
Spatio-Temporal Video GroundingHC-STVG1vIoU@0.363.1TA-STVG
Spatio-Temporal Video GroundingHC-STVG1vIoU@0.536.8TA-STVG
Spatio-Temporal Video GroundingHC-STVG2Val m_vIoU40.2TA-STVG
Spatio-Temporal Video GroundingHC-STVG2Val vIoU@0.365.8TA-STVG
Spatio-Temporal Video GroundingHC-STVG2Val vIoU@0.536.7TA-STVG

Related Papers

VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17MGFFD-VLM: Multi-Granularity Prompt Learning for Face Forgery Detection with VLM2025-07-16Non-Adaptive Adversarial Face Generation2025-07-16Attributes Shape the Embedding Space of Face Recognition Models2025-07-15COLIBRI Fuzzy Model: Color Linguistic-Based Representation and Interpretation2025-07-15Ref-Long: Benchmarking the Long-context Referencing Capability of Long-context Language Models2025-07-13Model Parallelism With Subnetwork Data Parallelism2025-07-11Bradley-Terry and Multi-Objective Reward Modeling Are Complementary2025-07-10