Jiang Liu, Hui Ding, Zhaowei Cai, Yuting Zhang, Ravi Kumar Satzoda, Vijay Mahadevan, R. Manmatha
In this work, instead of directly predicting the pixel-level segmentation masks, the problem of referring image segmentation is formulated as sequential polygon generation, and the predicted polygons can be later converted into segmentation masks. This is enabled by a new sequence-to-sequence framework, Polygon Transformer (PolyFormer), which takes a sequence of image patches and text query tokens as input, and outputs a sequence of polygon vertices autoregressively. For more accurate geometric localization, we propose a regression-based decoder, which predicts the precise floating-point coordinates directly, without any coordinate quantization error. In the experiments, PolyFormer outperforms the prior art by a clear margin, e.g., 5.40% and 4.52% absolute improvements on the challenging RefCOCO+ and RefCOCOg datasets. It also shows strong generalization ability when evaluated on the referring video segmentation task without fine-tuning, e.g., achieving competitive 61.5% J&F on the Ref-DAVIS17 dataset.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Instance Segmentation | RefCoCo val | Mean IoU | 76.94 | PolyFormer-L |
| Instance Segmentation | RefCoCo val | Overall IoU | 75.96 | PolyFormer-L |
| Instance Segmentation | RefCoCo val | Overall IoU | 74.82 | PolyFormer-B |
| Instance Segmentation | RefCOCOg-test | Mean IoU | 71.17 | PolyFormer-L |
| Instance Segmentation | RefCOCOg-test | Overall IoU | 70.19 | PolyFormer-L |
| Instance Segmentation | RefCOCOg-test | Mean IoU | 69.88 | PolyFormer-B |
| Instance Segmentation | RefCOCOg-test | Overall IoU | 69.05 | PolyFormer-B |
| Instance Segmentation | RefCOCO+ val | Mean IoU | 72.15 | PolyFormer-L |
| Instance Segmentation | RefCOCO+ val | Overall IoU | 69.33 | PolyFormer-L |
| Instance Segmentation | RefCOCO+ val | Mean IoU | 70.65 | PolyFormer-B |
| Instance Segmentation | RefCOCO+ val | Overall IoU | 67.64 | PolyFormer-B |
| Instance Segmentation | RefCOCO+ test B | Mean IoU | 66.73 | PolyFormer-L |
| Instance Segmentation | RefCOCO+ test B | Overall IoU | 61.87 | PolyFormer-L |
| Instance Segmentation | RefCOCO+ test B | Mean IoU | 64.64 | PolyFormer-B |
| Instance Segmentation | RefCOCO+ test B | Overall IoU | 59.33 | PolyFormer-B |
| Instance Segmentation | DAVIS 2017 (val) | J&F 1st frame | 60.9 | PolyFormer-B |
| Instance Segmentation | RefCOCO+ testA | Mean IoU | 75.71 | PolyFormer-L |
| Instance Segmentation | RefCOCO+ testA | Overall IoU | 74.56 | PolyFormer-L |
| Instance Segmentation | RefCOCO+ testA | Mean IoU | 74.51 | PolyFormer-B |
| Instance Segmentation | RefCOCO+ testA | Overall IoU | 72.89 | PolyFormer-B |
| Instance Segmentation | ReferIt | Mean IoU | 67.22 | PolyFormer-L |
| Instance Segmentation | ReferIt | Overall IoU | 72.6 | PolyFormer-L |
| Instance Segmentation | ReferIt | Mean IoU | 65.98 | PolyFormer-B |
| Instance Segmentation | ReferIt | Overall IoU | 71.91 | PolyFormer-B |
| Instance Segmentation | RefCOCOg-val | Mean IoU | 71.15 | PolyFormer-L |
| Instance Segmentation | RefCOCOg-val | Overall IoU | 69.2 | PolyFormer-L |
| Instance Segmentation | RefCOCOg-val | Mean IoU | 69.36 | PolyFormer-B |
| Instance Segmentation | RefCOCOg-val | Overall IoU | 67.76 | PolyFormer-B |
| Referring Expression Segmentation | RefCoCo val | Mean IoU | 76.94 | PolyFormer-L |
| Referring Expression Segmentation | RefCoCo val | Overall IoU | 75.96 | PolyFormer-L |
| Referring Expression Segmentation | RefCoCo val | Overall IoU | 74.82 | PolyFormer-B |
| Referring Expression Segmentation | RefCOCOg-test | Mean IoU | 71.17 | PolyFormer-L |
| Referring Expression Segmentation | RefCOCOg-test | Overall IoU | 70.19 | PolyFormer-L |
| Referring Expression Segmentation | RefCOCOg-test | Mean IoU | 69.88 | PolyFormer-B |
| Referring Expression Segmentation | RefCOCOg-test | Overall IoU | 69.05 | PolyFormer-B |
| Referring Expression Segmentation | RefCOCO+ val | Mean IoU | 72.15 | PolyFormer-L |
| Referring Expression Segmentation | RefCOCO+ val | Overall IoU | 69.33 | PolyFormer-L |
| Referring Expression Segmentation | RefCOCO+ val | Mean IoU | 70.65 | PolyFormer-B |
| Referring Expression Segmentation | RefCOCO+ val | Overall IoU | 67.64 | PolyFormer-B |
| Referring Expression Segmentation | RefCOCO+ test B | Mean IoU | 66.73 | PolyFormer-L |
| Referring Expression Segmentation | RefCOCO+ test B | Overall IoU | 61.87 | PolyFormer-L |
| Referring Expression Segmentation | RefCOCO+ test B | Mean IoU | 64.64 | PolyFormer-B |
| Referring Expression Segmentation | RefCOCO+ test B | Overall IoU | 59.33 | PolyFormer-B |
| Referring Expression Segmentation | DAVIS 2017 (val) | J&F 1st frame | 60.9 | PolyFormer-B |
| Referring Expression Segmentation | RefCOCO+ testA | Mean IoU | 75.71 | PolyFormer-L |
| Referring Expression Segmentation | RefCOCO+ testA | Overall IoU | 74.56 | PolyFormer-L |
| Referring Expression Segmentation | RefCOCO+ testA | Mean IoU | 74.51 | PolyFormer-B |
| Referring Expression Segmentation | RefCOCO+ testA | Overall IoU | 72.89 | PolyFormer-B |
| Referring Expression Segmentation | ReferIt | Mean IoU | 67.22 | PolyFormer-L |
| Referring Expression Segmentation | ReferIt | Overall IoU | 72.6 | PolyFormer-L |
| Referring Expression Segmentation | ReferIt | Mean IoU | 65.98 | PolyFormer-B |
| Referring Expression Segmentation | ReferIt | Overall IoU | 71.91 | PolyFormer-B |
| Referring Expression Segmentation | RefCOCOg-val | Mean IoU | 71.15 | PolyFormer-L |
| Referring Expression Segmentation | RefCOCOg-val | Overall IoU | 69.2 | PolyFormer-L |
| Referring Expression Segmentation | RefCOCOg-val | Mean IoU | 69.36 | PolyFormer-B |
| Referring Expression Segmentation | RefCOCOg-val | Overall IoU | 67.76 | PolyFormer-B |