TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/SafaRi:Adaptive Sequence Transformer for Weakly Supervised...

SafaRi:Adaptive Sequence Transformer for Weakly Supervised Referring Expression Segmentation

Sayan Nag, Koustava Goswami, Srikrishna Karanam

2024-07-02Referring ExpressionReferring Expression Segmentation
PaperPDF

Abstract

Referring Expression Segmentation (RES) aims to provide a segmentation mask of the target object in an image referred to by the text (i.e., referring expression). Existing methods require large-scale mask annotations. Moreover, such approaches do not generalize well to unseen/zero-shot scenarios. To address the aforementioned issues, we propose a weakly-supervised bootstrapping architecture for RES with several new algorithmic innovations. To the best of our knowledge, ours is the first approach that considers only a fraction of both mask and box annotations (shown in Figure 1 and Table 1) for training. To enable principled training of models in such low-annotation settings, improve image-text region-level alignment, and further enhance spatial localization of the target object in the image, we propose Cross-modal Fusion with Attention Consistency module. For automatic pseudo-labeling of unlabeled samples, we introduce a novel Mask Validity Filtering routine based on a spatially aware zero-shot proposal scoring approach. Extensive experiments show that with just 30% annotations, our model SafaRi achieves 59.31 and 48.26 mIoUs as compared to 58.93 and 48.19 mIoUs obtained by the fully-supervised SOTA method SeqTR respectively on RefCOCO+@testA and RefCOCO+testB datasets. SafaRi also outperforms SeqTR by 11.7% (on RefCOCO+testA) and 19.6% (on RefCOCO+testB) in a fully-supervised setting and demonstrates strong generalization capabilities in unseen/zero-shot tasks.

Results

TaskDatasetMetricValueModel
Instance SegmentationRefCOCO testAOverall IoU77.83SafaRi
Instance SegmentationRefCoCo valOverall IoU77.21SafaRi-B
Instance SegmentationRefCOCO testBOverall IoU70.71SafaRi
Instance SegmentationRefCOCOg-testOverall IoU71.06SafaRi-B
Instance SegmentationRefCOCO+ valOverall IoU70.78SafaRi-B
Instance SegmentationRefCOCO+ test BOverall IoU64.88SafaRi-B
Instance SegmentationDAVIS 2017 (val)J&F 1st frame61.3SafaRi-B
Instance SegmentationRefCOCO+ testAOverall IoU74.53SafaRi-B
Instance SegmentationRefCOCOg-valOverall IoU70.48SafaRi-B
Referring Expression SegmentationRefCOCO testAOverall IoU77.83SafaRi
Referring Expression SegmentationRefCoCo valOverall IoU77.21SafaRi-B
Referring Expression SegmentationRefCOCO testBOverall IoU70.71SafaRi
Referring Expression SegmentationRefCOCOg-testOverall IoU71.06SafaRi-B
Referring Expression SegmentationRefCOCO+ valOverall IoU70.78SafaRi-B
Referring Expression SegmentationRefCOCO+ test BOverall IoU64.88SafaRi-B
Referring Expression SegmentationDAVIS 2017 (val)J&F 1st frame61.3SafaRi-B
Referring Expression SegmentationRefCOCO+ testAOverall IoU74.53SafaRi-B
Referring Expression SegmentationRefCOCOg-valOverall IoU70.48SafaRi-B

Related Papers

DeRIS: Decoupling Perception and Cognition for Enhanced Referring Image Segmentation through Loopback Synergy2025-07-02Mask-aware Text-to-Image Retrieval: Referring Expression Segmentation Meets Cross-modal Retrieval2025-06-28Detecting Referring Expressions in Visually Grounded Dialogue with Autoregressive Language Models2025-06-26Referring Expression Instance Retrieval and A Strong End-to-End Baseline2025-06-23Gondola: Grounded Vision Language Planning for Generalizable Robotic Manipulation2025-06-12Synthetic Visual Genome2025-06-09From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes2025-06-05Refer to Anything with Vision-Language Prompts2025-06-05