Kaining Ying, Zhenhua Wang, Cong Bai, Pengfei Zhou
Most instance segmentation models are not end-to-end trainable due to either the incorporation of proposal estimation (RPN) as a pre-processing or non-maximum suppression (NMS) as a post-processing. Here we propose a novel end-to-end instance segmentation method termed ISDA. It reshapes the task into predicting a set of object masks, which are generated via traditional convolution operation with learned position-aware kernels and features of objects. Such kernels and features are learned by leveraging a deformable attention network with multi-scale representation. Thanks to the introduced set-prediction mechanism, the proposed method is NMS-free. Empirically, ISDA outperforms Mask R-CNN (the strong baseline) by 2.6 points on MS-COCO, and achieves leading performance compared with recent models. Code will be available soon.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Instance Segmentation | COCO test-dev | AP50 | 62 | ISDA (ours) |
| Instance Segmentation | COCO test-dev | AP75 | 41.1 | ISDA (ours) |
| Instance Segmentation | COCO test-dev | APM | 41.2 | ISDA (ours) |
| Instance Segmentation | COCO test-dev | APS | 17 | ISDA (ours) |
| Instance Segmentation | COCO test-dev | mask AP | 38.7 | ISDA (ours) |
| Instance Segmentation | COCO test-dev | APL | 55.7 | ISDA (ResNet-50) |