TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Multi-task Visual Grounding with Coarse-to-Fine Consistenc...

Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraints

Ming Dai, Jian Li, Jiedong Zhuang, Xian Zhang, Wankou Yang

2025-01-12Visual GroundingReferring ExpressionReferring Expression ComprehensionReferring Expression SegmentationSemantic SegmentationImage Segmentation
PaperPDFCode(official)

Abstract

Multi-task visual grounding involves the simultaneous execution of localization and segmentation in images based on textual expressions. The majority of advanced methods predominantly focus on transformer-based multimodal fusion, aiming to extract robust multimodal representations. However, ambiguity between referring expression comprehension (REC) and referring image segmentation (RIS) is error-prone, leading to inconsistencies between multi-task predictions. Besides, insufficient multimodal understanding directly contributes to biased target perception. To overcome these challenges, we propose a Coarse-to-fine Consistency Constraints Visual Grounding architecture ($\text{C}^3\text{VG}$), which integrates implicit and explicit modeling approaches within a two-stage framework. Initially, query and pixel decoders are employed to generate preliminary detection and segmentation outputs, a process referred to as the Rough Semantic Perception (RSP) stage. These coarse predictions are subsequently refined through the proposed Mask-guided Interaction Module (MIM) and a novel explicit bidirectional consistency constraint loss to ensure consistent representations across tasks, which we term the Refined Consistency Interaction (RCI) stage. Furthermore, to address the challenge of insufficient multimodal understanding, we leverage pre-trained models based on visual-linguistic fusion representations. Empirical evaluations on the RefCOCO, RefCOCO+, and RefCOCOg datasets demonstrate the efficacy and soundness of $\text{C}^3\text{VG}$, which significantly outperforms state-of-the-art REC and RIS methods by a substantial margin. Code and model will be available at \url{https://github.com/Dmmm1997/C3VG}.

Results

TaskDatasetMetricValueModel
Instance SegmentationRefCOCO testAOverall IoU83.18C3VG
Instance SegmentationRefCoCo valOverall IoU80.89C3VG
Instance SegmentationRefCOCO testBOverall IoU77.86C3VG
Instance SegmentationRefCOCOg-testOverall IoU76.39C3VG
Instance SegmentationRefCOCO+ valOverall IoU74.68C3VG
Instance SegmentationRefCOCO+ test BOverall IoU68.95C3VG
Instance SegmentationRefCOCO+ testAOverall IoU77.96C3VG
Instance SegmentationRefCOCOg-valOverall IoU74.43C3VG
Referring Expression SegmentationRefCOCO testAOverall IoU83.18C3VG
Referring Expression SegmentationRefCoCo valOverall IoU80.89C3VG
Referring Expression SegmentationRefCOCO testBOverall IoU77.86C3VG
Referring Expression SegmentationRefCOCOg-testOverall IoU76.39C3VG
Referring Expression SegmentationRefCOCO+ valOverall IoU74.68C3VG
Referring Expression SegmentationRefCOCO+ test BOverall IoU68.95C3VG
Referring Expression SegmentationRefCOCO+ testAOverall IoU77.96C3VG
Referring Expression SegmentationRefCOCOg-valOverall IoU74.43C3VG

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17Unified Medical Image Segmentation with State Space Modeling Snake2025-07-17A Privacy-Preserving Semantic-Segmentation Method Using Domain-Adaptation Technique2025-07-17SAMST: A Transformer framework based on SAM pseudo label filtering for remote sensing semi-supervised semantic segmentation2025-07-16ViewSRD: 3D Visual Grounding via Structured Multi-View Decomposition2025-07-15Tomato Multi-Angle Multi-Pose Dataset for Fine-Grained Phenotyping2025-07-15