TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Joint Visual Grounding and Tracking with Natural Language ...

Joint Visual Grounding and Tracking with Natural Language Specification

Li Zhou, Zikun Zhou, Kaige Mao, Zhenyu He

2023-03-21CVPR 2023 1Visual GroundingVisual Tracking
PaperPDFCode(official)

Abstract

Tracking by natural language specification aims to locate the referred target in a sequence based on the natural language description. Existing algorithms solve this issue in two steps, visual grounding and tracking, and accordingly deploy the separated grounding model and tracking model to implement these two steps, respectively. Such a separated framework overlooks the link between visual grounding and tracking, which is that the natural language descriptions provide global semantic cues for localizing the target for both two steps. Besides, the separated framework can hardly be trained end-to-end. To handle these issues, we propose a joint visual grounding and tracking framework, which reformulates grounding and tracking as a unified task: localizing the referred target based on the given visual-language references. Specifically, we propose a multi-source relation modeling module to effectively build the relation between the visual-language references and the test image. In addition, we design a temporal modeling module to provide a temporal clue with the guidance of the global semantic information for our model, which effectively improves the adaptability to the appearance variations of the target. Extensive experimental results on TNL2K, LaSOT, OTB99, and RefCOCOg demonstrate that our method performs favorably against state-of-the-art algorithms for both tracking and grounding. Code is available at https://github.com/lizhou-cs/JointNLT.

Results

TaskDatasetMetricValueModel
Visual TrackingTNL2KAUC56.9JointNLT
Visual TrackingTNL2Kprecision58.1JointNLT

Related Papers

ViewSRD: 3D Visual Grounding via Structured Multi-View Decomposition2025-07-15VisualTrap: A Stealthy Backdoor Attack on GUI Agents via Visual Grounding Manipulation2025-07-09A Neural Representation Framework with LLM-Driven Spatial Reasoning for Open-Vocabulary 3D Visual Grounding2025-07-09High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning2025-07-08GTA1: GUI Test-time Scaling Agent2025-07-08What You Have is What You Track: Adaptive and Robust Multimodal Tracking2025-07-08DenseWorld-1M: Towards Detailed Dense Grounded Caption in the Real World2025-06-30SPAZER: Spatial-Semantic Progressive Reasoning Agent for Zero-shot 3D Visual Grounding2025-06-27