Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraints

Ming Dai, Jian Li, Jiedong Zhuang, Xian Zhang, Wankou Yang

2025-01-12Visual Grounding Referring Expression Referring Expression Comprehension Referring Expression Segmentation Semantic Segmentation Image Segmentation

Paper PDF Code(official)

Abstract

Multi-task visual grounding involves the simultaneous execution of localization and segmentation in images based on textual expressions. The majority of advanced methods predominantly focus on transformer-based multimodal fusion, aiming to extract robust multimodal representations. However, ambiguity between referring expression comprehension (REC) and referring image segmentation (RIS) is error-prone, leading to inconsistencies between multi-task predictions. Besides, insufficient multimodal understanding directly contributes to biased target perception. To overcome these challenges, we propose a Coarse-to-fine Consistency Constraints Visual Grounding architecture ($\text{C}^3\text{VG}$), which integrates implicit and explicit modeling approaches within a two-stage framework. Initially, query and pixel decoders are employed to generate preliminary detection and segmentation outputs, a process referred to as the Rough Semantic Perception (RSP) stage. These coarse predictions are subsequently refined through the proposed Mask-guided Interaction Module (MIM) and a novel explicit bidirectional consistency constraint loss to ensure consistent representations across tasks, which we term the Refined Consistency Interaction (RCI) stage. Furthermore, to address the challenge of insufficient multimodal understanding, we leverage pre-trained models based on visual-linguistic fusion representations. Empirical evaluations on the RefCOCO, RefCOCO+, and RefCOCOg datasets demonstrate the efficacy and soundness of $\text{C}^3\text{VG}$, which significantly outperforms state-of-the-art REC and RIS methods by a substantial margin. Code and model will be available at \url{https://github.com/Dmmm1997/C3VG}.

Results

Task	Dataset	Metric	Value	Model
Instance Segmentation	RefCOCO testA	Overall IoU	83.18	C3VG
Instance Segmentation	RefCoCo val	Overall IoU	80.89	C3VG
Instance Segmentation	RefCOCO testB	Overall IoU	77.86	C3VG
Instance Segmentation	RefCOCOg-test	Overall IoU	76.39	C3VG
Instance Segmentation	RefCOCO+ val	Overall IoU	74.68	C3VG
Instance Segmentation	RefCOCO+ test B	Overall IoU	68.95	C3VG
Instance Segmentation	RefCOCO+ testA	Overall IoU	77.96	C3VG
Instance Segmentation	RefCOCOg-val	Overall IoU	74.43	C3VG
Referring Expression Segmentation	RefCOCO testA	Overall IoU	83.18	C3VG
Referring Expression Segmentation	RefCoCo val	Overall IoU	80.89	C3VG
Referring Expression Segmentation	RefCOCO testB	Overall IoU	77.86	C3VG
Referring Expression Segmentation	RefCOCOg-test	Overall IoU	76.39	C3VG
Referring Expression Segmentation	RefCOCO+ val	Overall IoU	74.68	C3VG
Referring Expression Segmentation	RefCOCO+ test B	Overall IoU	68.95	C3VG
Referring Expression Segmentation	RefCOCO+ testA	Overall IoU	77.96	C3VG
Referring Expression Segmentation	RefCOCOg-val	Overall IoU	74.43	C3VG

Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraints

Abstract

Results

Related Papers

Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraints

Abstract

Results

Related Papers