TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/SeqTR: A Simple yet Universal Network for Visual Grounding

SeqTR: A Simple yet Universal Network for Visual Grounding

Chaoyang Zhu, Yiyi Zhou, Yunhang Shen, Gen Luo, Xingjia Pan, Mingbao Lin, Chao Chen, Liujuan Cao, Xiaoshuai Sun, Rongrong Ji

2022-03-30Visual GroundingReferring ExpressionReferring Expression ComprehensionReferring Expression Segmentation
PaperPDFCodeCode(official)Code

Abstract

In this paper, we propose a simple yet universal network termed SeqTR for visual grounding tasks, e.g., phrase localization, referring expression comprehension (REC) and segmentation (RES). The canonical paradigms for visual grounding often require substantial expertise in designing network architectures and loss functions, making them hard to generalize across tasks. To simplify and unify the modeling, we cast visual grounding as a point prediction problem conditioned on image and text inputs, where either the bounding box or binary mask is represented as a sequence of discrete coordinate tokens. Under this paradigm, visual grounding tasks are unified in our SeqTR network without task-specific branches or heads, e.g., the convolutional mask decoder for RES, which greatly reduces the complexity of multi-task modeling. In addition, SeqTR also shares the same optimization objective for all tasks with a simple cross-entropy loss, further reducing the complexity of deploying hand-crafted loss functions. Experiments on five benchmark datasets demonstrate that the proposed SeqTR outperforms (or is on par with) the existing state-of-the-arts, proving that a simple yet universal approach for visual grounding is indeed feasible. Source code is available at https://github.com/sean-zhuh/SeqTR.

Results

TaskDatasetMetricValueModel
Instance SegmentationRefCOCO testAOverall IoU69.79SeqTR
Instance SegmentationRefCOCO testBOverall IoU64.12SeqTR
Referring Expression SegmentationRefCOCO testAOverall IoU69.79SeqTR
Referring Expression SegmentationRefCOCO testBOverall IoU64.12SeqTR

Related Papers

ViewSRD: 3D Visual Grounding via Structured Multi-View Decomposition2025-07-15VisualTrap: A Stealthy Backdoor Attack on GUI Agents via Visual Grounding Manipulation2025-07-09A Neural Representation Framework with LLM-Driven Spatial Reasoning for Open-Vocabulary 3D Visual Grounding2025-07-09High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning2025-07-08GTA1: GUI Test-time Scaling Agent2025-07-08DeRIS: Decoupling Perception and Cognition for Enhanced Referring Image Segmentation through Loopback Synergy2025-07-02DenseWorld-1M: Towards Detailed Dense Grounded Caption in the Real World2025-06-30Mask-aware Text-to-Image Retrieval: Referring Expression Segmentation Meets Cross-modal Retrieval2025-06-28