TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/TubeDETR: Spatio-Temporal Video Grounding with Transformers

TubeDETR: Spatio-Temporal Video Grounding with Transformers

Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid

2022-03-30CVPR 2022 1Language-Based Temporal LocalizationVisual GroundingVideo GroundingSpatio-Temporal Video GroundingNatural Language Visual GroundingPerson-centric Visual GroundingTemporal Localizationobject-detectionObject Detection
PaperPDFCode(official)

Abstract

We consider the problem of localizing a spatio-temporal tube in a video corresponding to a given text query. This is a challenging task that requires the joint and efficient modeling of temporal, spatial and multi-modal interactions. To address this task, we propose TubeDETR, a transformer-based architecture inspired by the recent success of such models for text-conditioned object detection. Our model notably includes: (i) an efficient video and text encoder that models spatial multi-modal interactions over sparsely sampled frames and (ii) a space-time decoder that jointly performs spatio-temporal localization. We demonstrate the advantage of our proposed components through an extensive ablation study. We also evaluate our full approach on the spatio-temporal video grounding task and demonstrate improvements over the state of the art on the challenging VidSTG and HC-STVG benchmarks. Code and trained models are publicly available at https://antoyang.github.io/tubedetr.html.

Results

TaskDatasetMetricValueModel
Spatio-Temporal Video GroundingVidSTGDeclarative m_vIoU30.4TubeDETR
Spatio-Temporal Video GroundingVidSTGDeclarative vIoU@0.342.5TubeDETR
Spatio-Temporal Video GroundingVidSTGDeclarative vIoU@0.528.2TubeDETR
Spatio-Temporal Video GroundingVidSTGInterrogative m_vIoU25.7TubeDETR
Spatio-Temporal Video GroundingVidSTGInterrogative vIoU@0.335.7TubeDETR
Spatio-Temporal Video GroundingVidSTGInterrogative vIoU@0.523.2TubeDETR
Spatio-Temporal Video GroundingHC-STVG1m_vIoU32.4TubeDETR
Spatio-Temporal Video GroundingHC-STVG1vIoU@0.349.8TubeDETR
Spatio-Temporal Video GroundingHC-STVG1vIoU@0.523.5TubeDETR
Spatio-Temporal Video GroundingHC-STVG2Val m_vIoU36.4TubeDETR
Spatio-Temporal Video GroundingHC-STVG2Val vIoU@0.358.8TubeDETR
Spatio-Temporal Video GroundingHC-STVG2Val vIoU@0.530.6TubeDETR

Related Papers

VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17RS-TinyNet: Stage-wise Feature Fusion Network for Detecting Tiny Objects in Remote Sensing Images2025-07-17Decoupled PROB: Decoupled Query Initialization Tasks and Objectness-Class Learning for Open World Object Detection2025-07-17Dual LiDAR-Based Traffic Movement Count Estimation at a Signalized Intersection: Deployment, Data Collection, and Preliminary Analysis2025-07-17Vision-based Perception for Autonomous Vehicles in Obstacle Avoidance Scenarios2025-07-16ViewSRD: 3D Visual Grounding via Structured Multi-View Decomposition2025-07-15Tomato Multi-Angle Multi-Pose Dataset for Fine-Grained Phenotyping2025-07-15