TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Mask Grounding for Referring Image Segmentation

Mask Grounding for Referring Image Segmentation

Yong Xien Chng, Henry Zheng, Yizeng Han, Xuchong Qiu, Gao Huang

2023-12-19CVPR 2024 1Visual Groundingcross-modal alignmentReferring Expression SegmentationSegmentationSemantic SegmentationImage Segmentation
PaperPDFCode(official)

Abstract

Referring Image Segmentation (RIS) is a challenging task that requires an algorithm to segment objects referred by free-form language expressions. Despite significant progress in recent years, most state-of-the-art (SOTA) methods still suffer from considerable language-image modality gap at the pixel and word level. These methods generally 1) rely on sentence-level language features for language-image alignment and 2) lack explicit training supervision for fine-grained visual grounding. Consequently, they exhibit weak object-level correspondence between visual and language features. Without well-grounded features, prior methods struggle to understand complex expressions that require strong reasoning over relationships among multiple objects, especially when dealing with rarely used or ambiguous clauses. To tackle this challenge, we introduce a novel Mask Grounding auxiliary task that significantly improves visual grounding within language features, by explicitly teaching the model to learn fine-grained correspondence between masked textual tokens and their matching visual objects. Mask Grounding can be directly used on prior RIS methods and consistently bring improvements. Furthermore, to holistically address the modality gap, we also design a cross-modal alignment loss and an accompanying alignment module. These additions work synergistically with Mask Grounding. With all these techniques, our comprehensive approach culminates in MagNet (Mask-grounded Network), an architecture that significantly outperforms prior arts on three key benchmarks (RefCOCO, RefCOCO+ and G-Ref), demonstrating our method's effectiveness in addressing current limitations of RIS algorithms. Our code and pre-trained weights will be released.

Results

TaskDatasetMetricValueModel
Instance SegmentationRefCOCO testAOverall IoU78.24MagNet
Instance SegmentationRefCoCo valOverall IoU75.24MagNet
Instance SegmentationRefCOCO testBOverall IoU71.05MagNet
Instance SegmentationRefCOCOg-testOverall IoU66.03MagNet
Instance SegmentationRefCOCO+ valOverall IoU66.16MagNet
Instance SegmentationRefCOCO+ test BOverall IoU58.14MagNet
Instance SegmentationRefCOCO+ testAOverall IoU71.32MagNet
Instance SegmentationRefCOCOg-valOverall IoU65.36MagNet
Referring Expression SegmentationRefCOCO testAOverall IoU78.24MagNet
Referring Expression SegmentationRefCoCo valOverall IoU75.24MagNet
Referring Expression SegmentationRefCOCO testBOverall IoU71.05MagNet
Referring Expression SegmentationRefCOCOg-testOverall IoU66.03MagNet
Referring Expression SegmentationRefCOCO+ valOverall IoU66.16MagNet
Referring Expression SegmentationRefCOCO+ test BOverall IoU58.14MagNet
Referring Expression SegmentationRefCOCO+ testAOverall IoU71.32MagNet
Referring Expression SegmentationRefCOCOg-valOverall IoU65.36MagNet

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21Transformer-based Spatial Grounding: A Comprehensive Survey2025-07-17Deep Learning-Based Fetal Lung Segmentation from Diffusion-weighted MRI Images and Lung Maturity Evaluation for Fetal Growth Restriction2025-07-17DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17From Variability To Accuracy: Conditional Bernoulli Diffusion Models with Consensus-Driven Correction for Thin Structure Segmentation2025-07-17Unleashing Vision Foundation Models for Coronary Artery Segmentation: Parallel ViT-CNN Encoding and Variational Fusion2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17Unified Medical Image Segmentation with State Space Modeling Snake2025-07-17