Minhyun Lee, Seungho Lee, Song Park, Dongyoon Han, Byeongho Heo, Hyunjung Shim
Referring Image Segmentation (RIS) is an advanced vision-language task that involves identifying and segmenting objects within an image as described by free-form text descriptions. While previous studies focused on aligning visual and language features, exploring training techniques, such as data augmentation, remains underexplored. In this work, we explore effective data augmentation for RIS and propose a novel training framework called Masked Referring Image Segmentation (MaskRIS). We observe that the conventional image augmentations fall short of RIS, leading to performance degradation, while simple random masking significantly enhances the performance of RIS. MaskRIS uses both image and text masking, followed by Distortion-aware Contextual Learning (DCL) to fully exploit the benefits of the masking strategy. This approach can improve the model's robustness to occlusions, incomplete information, and various linguistic complexities, resulting in a significant performance improvement. Experiments demonstrate that MaskRIS can easily be applied to various RIS models, outperforming existing methods in both fully supervised and weakly supervised settings. Finally, MaskRIS achieves new state-of-the-art performance on RefCOCO, RefCOCO+, and RefCOCOg datasets. Code is available at https://github.com/naver-ai/maskris.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Instance Segmentation | RefCOCO testA | Overall IoU | 80.64 | MaskRIS (Swin-B, combined DB) |
| Instance Segmentation | RefCOCO testA | Mean IoU | 80.24 | MaskRIS (Swin-B) |
| Instance Segmentation | RefCOCO testA | Overall IoU | 78.96 | MaskRIS (Swin-B) |
| Instance Segmentation | RefCoCo val | Overall IoU | 78.71 | MaskRIS (Swin-B, combined DB) |
| Instance Segmentation | RefCoCo val | Mean IoU | 78.35 | MaskRIS (Swin-B) |
| Instance Segmentation | RefCoCo val | Overall IoU | 76.49 | MaskRIS (Swin-B) |
| Instance Segmentation | RefCOCO testB | Overall IoU | 75.1 | MaskRIS (Swin-B, combined DB) |
| Instance Segmentation | RefCOCO testB | Mean IoU | 76.06 | MaskRIS (Swin-B) |
| Instance Segmentation | RefCOCO testB | Overall IoU | 73.96 | MaskRIS (Swin-B) |
| Instance Segmentation | RefCOCOg-test | Overall IoU | 71.09 | MaskRIS (Swin-B, combined DB) |
| Instance Segmentation | RefCOCOg-test | Mean IoU | 69.42 | MaskRIS (Swin-B) |
| Instance Segmentation | RefCOCOg-test | Overall IoU | 66.5 | MaskRIS (Swin-B) |
| Instance Segmentation | RefCOCO+ val | Overall IoU | 70.26 | MaskRIS (Swin-B, combined DB) |
| Instance Segmentation | RefCOCO+ val | Mean IoU | 71.68 | MaskRIS (Swin-B) |
| Instance Segmentation | RefCOCO+ val | Overall IoU | 67.54 | MaskRIS (Swin-B) |
| Instance Segmentation | RefCOCO+ test B | Overall IoU | 62.83 | MaskRIS (Swin-B, combined DB) |
| Instance Segmentation | RefCOCO+ test B | Mean IoU | 64.5 | MaskRIS (Swin-B) |
| Instance Segmentation | RefCOCO+ test B | Overall IoU | 59.39 | MaskRIS (Swin-B) |
| Instance Segmentation | RefCOCO+ testA | Overall IoU | 75.15 | MaskRIS (Swin-B, combined DB) |
| Instance Segmentation | RefCOCO+ testA | Mean IoU | 76.73 | MaskRIS (Swin-B) |
| Instance Segmentation | RefCOCO+ testA | Overall IoU | 74.46 | MaskRIS (Swin-B) |
| Instance Segmentation | RefCOCOg-val | Overall IoU | 69.12 | MaskRIS (Swin-B, combined DB) |
| Instance Segmentation | RefCOCOg-val | Mean IoU | 69.31 | MaskRIS (Swin-B) |
| Instance Segmentation | RefCOCOg-val | Overall IoU | 65.55 | MaskRIS (Swin-B) |
| Referring Expression Segmentation | RefCOCO testA | Overall IoU | 80.64 | MaskRIS (Swin-B, combined DB) |
| Referring Expression Segmentation | RefCOCO testA | Mean IoU | 80.24 | MaskRIS (Swin-B) |
| Referring Expression Segmentation | RefCOCO testA | Overall IoU | 78.96 | MaskRIS (Swin-B) |
| Referring Expression Segmentation | RefCoCo val | Overall IoU | 78.71 | MaskRIS (Swin-B, combined DB) |
| Referring Expression Segmentation | RefCoCo val | Mean IoU | 78.35 | MaskRIS (Swin-B) |
| Referring Expression Segmentation | RefCoCo val | Overall IoU | 76.49 | MaskRIS (Swin-B) |
| Referring Expression Segmentation | RefCOCO testB | Overall IoU | 75.1 | MaskRIS (Swin-B, combined DB) |
| Referring Expression Segmentation | RefCOCO testB | Mean IoU | 76.06 | MaskRIS (Swin-B) |
| Referring Expression Segmentation | RefCOCO testB | Overall IoU | 73.96 | MaskRIS (Swin-B) |
| Referring Expression Segmentation | RefCOCOg-test | Overall IoU | 71.09 | MaskRIS (Swin-B, combined DB) |
| Referring Expression Segmentation | RefCOCOg-test | Mean IoU | 69.42 | MaskRIS (Swin-B) |
| Referring Expression Segmentation | RefCOCOg-test | Overall IoU | 66.5 | MaskRIS (Swin-B) |
| Referring Expression Segmentation | RefCOCO+ val | Overall IoU | 70.26 | MaskRIS (Swin-B, combined DB) |
| Referring Expression Segmentation | RefCOCO+ val | Mean IoU | 71.68 | MaskRIS (Swin-B) |
| Referring Expression Segmentation | RefCOCO+ val | Overall IoU | 67.54 | MaskRIS (Swin-B) |
| Referring Expression Segmentation | RefCOCO+ test B | Overall IoU | 62.83 | MaskRIS (Swin-B, combined DB) |
| Referring Expression Segmentation | RefCOCO+ test B | Mean IoU | 64.5 | MaskRIS (Swin-B) |
| Referring Expression Segmentation | RefCOCO+ test B | Overall IoU | 59.39 | MaskRIS (Swin-B) |
| Referring Expression Segmentation | RefCOCO+ testA | Overall IoU | 75.15 | MaskRIS (Swin-B, combined DB) |
| Referring Expression Segmentation | RefCOCO+ testA | Mean IoU | 76.73 | MaskRIS (Swin-B) |
| Referring Expression Segmentation | RefCOCO+ testA | Overall IoU | 74.46 | MaskRIS (Swin-B) |
| Referring Expression Segmentation | RefCOCOg-val | Overall IoU | 69.12 | MaskRIS (Swin-B, combined DB) |
| Referring Expression Segmentation | RefCOCOg-val | Mean IoU | 69.31 | MaskRIS (Swin-B) |
| Referring Expression Segmentation | RefCOCOg-val | Overall IoU | 65.55 | MaskRIS (Swin-B) |