Jiale Cao, Rao Muhammad Anwer, Hisham Cholakkal, Fahad Shahbaz Khan, Yanwei Pang, Ling Shao
Single-stage instance segmentation approaches have recently gained popularity due to their speed and simplicity, but are still lagging behind in accuracy, compared to two-stage methods. We propose a fast single-stage instance segmentation method, called SipMask, that preserves instance-specific spatial information by separating mask prediction of an instance to different sub-regions of a detected bounding-box. Our main contribution is a novel light-weight spatial preservation (SP) module that generates a separate set of spatial coefficients for each sub-region within a bounding-box, leading to improved mask predictions. It also enables accurate delineation of spatially adjacent instances. Further, we introduce a mask alignment weighting loss and a feature alignment scheme to better correlate mask prediction with object detection. On COCO test-dev, our SipMask outperforms the existing single-stage methods. Compared to the state-of-the-art single-stage TensorMask, SipMask obtains an absolute gain of 1.0% (mask AP), while providing a four-fold speedup. In terms of real-time capabilities, SipMask outperforms YOLACT with an absolute gain of 3.0% (mask AP) under similar settings, while operating at comparable speed on a Titan Xp. We also evaluate our SipMask for real-time video instance segmentation, achieving promising results on YouTube-VIS dataset. The source code is available at https://github.com/JialeCao001/SipMask.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Instance Segmentation | COCO test-dev | AP50 | 60.2 | SipMask (ResNet-101, single-scale test) |
| Instance Segmentation | COCO test-dev | AP75 | 40.8 | SipMask (ResNet-101, single-scale test) |
| Instance Segmentation | COCO test-dev | APL | 54.3 | SipMask (ResNet-101, single-scale test) |
| Instance Segmentation | COCO test-dev | APM | 40.8 | SipMask (ResNet-101, single-scale test) |
| Instance Segmentation | COCO test-dev | APS | 17.8 | SipMask (ResNet-101, single-scale test) |
| Instance Segmentation | COCO test-dev | mask AP | 38.1 | SipMask (ResNet-101, single-scale test) |
| Instance Segmentation | MSCOCO | AP50 | 55.6 | SipMask++ (ResNet-101, single-scale test) |
| Instance Segmentation | MSCOCO | AP75 | 37.6 | SipMask++ (ResNet-101, single-scale test) |
| Instance Segmentation | MSCOCO | APL | 56.8 | SipMask++ (ResNet-101, single-scale test) |
| Instance Segmentation | MSCOCO | APM | 38.3 | SipMask++ (ResNet-101, single-scale test) |
| Instance Segmentation | MSCOCO | APS | 11.2 | SipMask++ (ResNet-101, single-scale test) |
| Instance Segmentation | MSCOCO | mask AP | 35.4 | SipMask++ (ResNet-101, single-scale test) |
| Instance Segmentation | MSCOCO | AP50 | 53.4 | SipMask (ResNet-101, single-scale test) |
| Instance Segmentation | MSCOCO | AP75 | 34.3 | SipMask (ResNet-101, single-scale test) |
| Instance Segmentation | MSCOCO | APL | 54 | SipMask (ResNet-101, single-scale test) |
| Instance Segmentation | MSCOCO | APM | 35.6 | SipMask (ResNet-101, single-scale test) |
| Instance Segmentation | MSCOCO | APS | 9.3 | SipMask (ResNet-101, single-scale test) |
| Instance Segmentation | MSCOCO | mask AP | 32.8 | SipMask (ResNet-101, single-scale test) |
| Instance Segmentation | MSCOCO | AP50 | 51.9 | SipMask (ResNet-50, single-scale test) |
| Instance Segmentation | MSCOCO | AP75 | 32.3 | SipMask (ResNet-50, single-scale test) |
| Instance Segmentation | MSCOCO | APL | 49.8 | SipMask (ResNet-50, single-scale test) |
| Instance Segmentation | MSCOCO | APM | 33.6 | SipMask (ResNet-50, single-scale test) |
| Instance Segmentation | MSCOCO | APS | 9.2 | SipMask (ResNet-50, single-scale test) |
| Instance Segmentation | MSCOCO | mask AP | 31.2 | SipMask (ResNet-50, single-scale test) |
| Video Instance Segmentation | YouTube-VIS validation | AP50 | 54.1 | SipMask (ResNet-50, ms-train, single-scale test) |
| Video Instance Segmentation | YouTube-VIS validation | AP75 | 35.8 | SipMask (ResNet-50, ms-train, single-scale test) |
| Video Instance Segmentation | YouTube-VIS validation | AR1 | 35.4 | SipMask (ResNet-50, ms-train, single-scale test) |
| Video Instance Segmentation | YouTube-VIS validation | AR10 | 40.1 | SipMask (ResNet-50, ms-train, single-scale test) |
| Video Instance Segmentation | YouTube-VIS validation | mask AP | 33.7 | SipMask (ResNet-50, ms-train, single-scale test) |
| Video Instance Segmentation | YouTube-VIS validation | AP50 | 53 | SipMask (ResNet-50, single-scale test) |
| Video Instance Segmentation | YouTube-VIS validation | AP75 | 33.3 | SipMask (ResNet-50, single-scale test) |
| Video Instance Segmentation | YouTube-VIS validation | AR1 | 33.5 | SipMask (ResNet-50, single-scale test) |
| Video Instance Segmentation | YouTube-VIS validation | AR10 | 38.9 | SipMask (ResNet-50, single-scale test) |
| Video Instance Segmentation | YouTube-VIS validation | mask AP | 32.5 | SipMask (ResNet-50, single-scale test) |