SafaRi:Adaptive Sequence Transformer for Weakly Supervised Referring Expression Segmentation

Sayan Nag, Koustava Goswami, Srikrishna Karanam

2024-07-02Referring Expression Referring Expression Segmentation

Abstract

Referring Expression Segmentation (RES) aims to provide a segmentation mask of the target object in an image referred to by the text (i.e., referring expression). Existing methods require large-scale mask annotations. Moreover, such approaches do not generalize well to unseen/zero-shot scenarios. To address the aforementioned issues, we propose a weakly-supervised bootstrapping architecture for RES with several new algorithmic innovations. To the best of our knowledge, ours is the first approach that considers only a fraction of both mask and box annotations (shown in Figure 1 and Table 1) for training. To enable principled training of models in such low-annotation settings, improve image-text region-level alignment, and further enhance spatial localization of the target object in the image, we propose Cross-modal Fusion with Attention Consistency module. For automatic pseudo-labeling of unlabeled samples, we introduce a novel Mask Validity Filtering routine based on a spatially aware zero-shot proposal scoring approach. Extensive experiments show that with just 30% annotations, our model SafaRi achieves 59.31 and 48.26 mIoUs as compared to 58.93 and 48.19 mIoUs obtained by the fully-supervised SOTA method SeqTR respectively on RefCOCO+@testA and RefCOCO+testB datasets. SafaRi also outperforms SeqTR by 11.7% (on RefCOCO+testA) and 19.6% (on RefCOCO+testB) in a fully-supervised setting and demonstrates strong generalization capabilities in unseen/zero-shot tasks.

Results

Task	Dataset	Metric	Value	Model
Instance Segmentation	RefCOCO testA	Overall IoU	77.83	SafaRi
Instance Segmentation	RefCoCo val	Overall IoU	77.21	SafaRi-B
Instance Segmentation	RefCOCO testB	Overall IoU	70.71	SafaRi
Instance Segmentation	RefCOCOg-test	Overall IoU	71.06	SafaRi-B
Instance Segmentation	RefCOCO+ val	Overall IoU	70.78	SafaRi-B
Instance Segmentation	RefCOCO+ test B	Overall IoU	64.88	SafaRi-B
Instance Segmentation	DAVIS 2017 (val)	J&F 1st frame	61.3	SafaRi-B
Instance Segmentation	RefCOCO+ testA	Overall IoU	74.53	SafaRi-B
Instance Segmentation	RefCOCOg-val	Overall IoU	70.48	SafaRi-B
Referring Expression Segmentation	RefCOCO testA	Overall IoU	77.83	SafaRi
Referring Expression Segmentation	RefCoCo val	Overall IoU	77.21	SafaRi-B
Referring Expression Segmentation	RefCOCO testB	Overall IoU	70.71	SafaRi
Referring Expression Segmentation	RefCOCOg-test	Overall IoU	71.06	SafaRi-B
Referring Expression Segmentation	RefCOCO+ val	Overall IoU	70.78	SafaRi-B
Referring Expression Segmentation	RefCOCO+ test B	Overall IoU	64.88	SafaRi-B
Referring Expression Segmentation	DAVIS 2017 (val)	J&F 1st frame	61.3	SafaRi-B
Referring Expression Segmentation	RefCOCO+ testA	Overall IoU	74.53	SafaRi-B
Referring Expression Segmentation	RefCOCOg-val	Overall IoU	70.48	SafaRi-B

SafaRi:Adaptive Sequence Transformer for Weakly Supervised Referring Expression Segmentation

Abstract

Results

Related Papers

SafaRi:Adaptive Sequence Transformer for Weakly Supervised Referring Expression Segmentation

Abstract

Results

Related Papers