Ziqin Wang, Jun Xu, Li Liu, Fan Zhu, Ling Shao
Despite online learning (OL) techniques have boosted the performance of semi-supervised video object segmentation (VOS) methods, the huge time costs of OL greatly restrict their practicality. Matching based and propagation based methods run at a faster speed by avoiding OL techniques. However, they are limited by sub-optimal accuracy, due to mismatching and drifting problems. In this paper, we develop a real-time yet very accurate Ranking Attention Network (RANet) for VOS. Specifically, to integrate the insights of matching based and propagation based methods, we employ an encoder-decoder framework to learn pixel-level similarity and segmentation in an end-to-end manner. To better utilize the similarity maps, we propose a novel ranking attention module, which automatically ranks and selects these maps for fine-grained VOS performance. Experiments on DAVIS-16 and DAVIS-17 datasets show that our RANet achieves the best speed-accuracy trade-off, e.g., with 33 milliseconds per frame and J&F=85.5% on DAVIS-16. With OL, our RANet reaches J&F=87.1% on DAVIS-16, exceeding state-of-the-art VOS methods. The code can be found at https://github.com/Storife/RANet.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | DAVIS 2017 (val) | F-measure (Decay) | 19.7 | RANet |
| Video | DAVIS 2017 (val) | F-measure (Mean) | 68.2 | RANet |
| Video | DAVIS 2017 (val) | F-measure (Recall) | 78.8 | RANet |
| Video | DAVIS 2017 (val) | J&F | 65.7 | RANet |
| Video | DAVIS 2017 (val) | Jaccard (Decay) | 18.6 | RANet |
| Video | DAVIS 2017 (val) | Jaccard (Mean) | 63.2 | RANet |
| Video | DAVIS 2017 (val) | Jaccard (Recall) | 73.7 | RANet |
| Video | DAVIS 2016 | F-measure (Decay) | 8.2 | RANet+ (online learning) |
| Video | DAVIS 2016 | F-measure (Mean) | 87.6 | RANet+ (online learning) |
| Video | DAVIS 2016 | F-measure (Recall) | 96.1 | RANet+ (online learning) |
| Video | DAVIS 2016 | J&F | 87.1 | RANet+ (online learning) |
| Video | DAVIS 2016 | Jaccard (Decay) | 7.4 | RANet+ (online learning) |
| Video | DAVIS 2016 | Jaccard (Mean) | 86.6 | RANet+ (online learning) |
| Video | DAVIS 2016 | Jaccard (Recall) | 97 | RANet+ (online learning) |
| Video | DAVIS 2016 | F-measure (Decay) | 5.1 | RANet |
| Video | DAVIS 2016 | F-measure (Mean) | 85.4 | RANet |
| Video | DAVIS 2016 | F-measure (Recall) | 94.9 | RANet |
| Video | DAVIS 2016 | J&F | 85.45 | RANet |
| Video | DAVIS 2016 | Jaccard (Decay) | 6.2 | RANet |
| Video | DAVIS 2016 | Jaccard (Mean) | 85.5 | RANet |
| Video | DAVIS 2016 | Jaccard (Recall) | 97.2 | RANet |
| Video | DAVIS 2017 (test-dev) | F-measure (Decay) | 22.1 | RANet |
| Video | DAVIS 2017 (test-dev) | F-measure (Mean) | 57.3 | RANet |
| Video | DAVIS 2017 (test-dev) | F-measure (Recall) | 67.7 | RANet |
| Video | DAVIS 2017 (test-dev) | J&F | 55.4 | RANet |
| Video | DAVIS 2017 (test-dev) | Jaccard (Decay) | 21.9 | RANet |
| Video | DAVIS 2017 (test-dev) | Jaccard (Mean) | 53.4 | RANet |
| Video | DAVIS 2017 (test-dev) | Jaccard (Recall) | 61.9 | RANet |
| Video | DAVIS (no YouTube-VOS training) | D16 val (F) | 85.4 | RANet |
| Video | DAVIS (no YouTube-VOS training) | D16 val (G) | 85.5 | RANet |
| Video | DAVIS (no YouTube-VOS training) | D16 val (J) | 85.5 | RANet |
| Video | DAVIS (no YouTube-VOS training) | D17 test (F) | 57.2 | RANet |
| Video | DAVIS (no YouTube-VOS training) | D17 test (G) | 55.3 | RANet |
| Video | DAVIS (no YouTube-VOS training) | D17 test (J) | 53.4 | RANet |
| Video | DAVIS (no YouTube-VOS training) | D17 val (F) | 68.2 | RANet |
| Video | DAVIS (no YouTube-VOS training) | D17 val (G) | 65.7 | RANet |
| Video | DAVIS (no YouTube-VOS training) | D17 val (J) | 63.2 | RANet |
| Video | DAVIS (no YouTube-VOS training) | FPS | 30.3 | RANet |
| Video Object Segmentation | DAVIS 2017 (val) | F-measure (Decay) | 19.7 | RANet |
| Video Object Segmentation | DAVIS 2017 (val) | F-measure (Mean) | 68.2 | RANet |
| Video Object Segmentation | DAVIS 2017 (val) | F-measure (Recall) | 78.8 | RANet |
| Video Object Segmentation | DAVIS 2017 (val) | J&F | 65.7 | RANet |
| Video Object Segmentation | DAVIS 2017 (val) | Jaccard (Decay) | 18.6 | RANet |
| Video Object Segmentation | DAVIS 2017 (val) | Jaccard (Mean) | 63.2 | RANet |
| Video Object Segmentation | DAVIS 2017 (val) | Jaccard (Recall) | 73.7 | RANet |
| Video Object Segmentation | DAVIS 2016 | F-measure (Decay) | 8.2 | RANet+ (online learning) |
| Video Object Segmentation | DAVIS 2016 | F-measure (Mean) | 87.6 | RANet+ (online learning) |
| Video Object Segmentation | DAVIS 2016 | F-measure (Recall) | 96.1 | RANet+ (online learning) |
| Video Object Segmentation | DAVIS 2016 | J&F | 87.1 | RANet+ (online learning) |
| Video Object Segmentation | DAVIS 2016 | Jaccard (Decay) | 7.4 | RANet+ (online learning) |
| Video Object Segmentation | DAVIS 2016 | Jaccard (Mean) | 86.6 | RANet+ (online learning) |
| Video Object Segmentation | DAVIS 2016 | Jaccard (Recall) | 97 | RANet+ (online learning) |
| Video Object Segmentation | DAVIS 2016 | F-measure (Decay) | 5.1 | RANet |
| Video Object Segmentation | DAVIS 2016 | F-measure (Mean) | 85.4 | RANet |
| Video Object Segmentation | DAVIS 2016 | F-measure (Recall) | 94.9 | RANet |
| Video Object Segmentation | DAVIS 2016 | J&F | 85.45 | RANet |
| Video Object Segmentation | DAVIS 2016 | Jaccard (Decay) | 6.2 | RANet |
| Video Object Segmentation | DAVIS 2016 | Jaccard (Mean) | 85.5 | RANet |
| Video Object Segmentation | DAVIS 2016 | Jaccard (Recall) | 97.2 | RANet |
| Video Object Segmentation | DAVIS 2017 (test-dev) | F-measure (Decay) | 22.1 | RANet |
| Video Object Segmentation | DAVIS 2017 (test-dev) | F-measure (Mean) | 57.3 | RANet |
| Video Object Segmentation | DAVIS 2017 (test-dev) | F-measure (Recall) | 67.7 | RANet |
| Video Object Segmentation | DAVIS 2017 (test-dev) | J&F | 55.4 | RANet |
| Video Object Segmentation | DAVIS 2017 (test-dev) | Jaccard (Decay) | 21.9 | RANet |
| Video Object Segmentation | DAVIS 2017 (test-dev) | Jaccard (Mean) | 53.4 | RANet |
| Video Object Segmentation | DAVIS 2017 (test-dev) | Jaccard (Recall) | 61.9 | RANet |
| Video Object Segmentation | DAVIS (no YouTube-VOS training) | D16 val (F) | 85.4 | RANet |
| Video Object Segmentation | DAVIS (no YouTube-VOS training) | D16 val (G) | 85.5 | RANet |
| Video Object Segmentation | DAVIS (no YouTube-VOS training) | D16 val (J) | 85.5 | RANet |
| Video Object Segmentation | DAVIS (no YouTube-VOS training) | D17 test (F) | 57.2 | RANet |
| Video Object Segmentation | DAVIS (no YouTube-VOS training) | D17 test (G) | 55.3 | RANet |
| Video Object Segmentation | DAVIS (no YouTube-VOS training) | D17 test (J) | 53.4 | RANet |
| Video Object Segmentation | DAVIS (no YouTube-VOS training) | D17 val (F) | 68.2 | RANet |
| Video Object Segmentation | DAVIS (no YouTube-VOS training) | D17 val (G) | 65.7 | RANet |
| Video Object Segmentation | DAVIS (no YouTube-VOS training) | D17 val (J) | 63.2 | RANet |
| Video Object Segmentation | DAVIS (no YouTube-VOS training) | FPS | 30.3 | RANet |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | F-measure (Decay) | 19.7 | RANet |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | F-measure (Mean) | 68.2 | RANet |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | F-measure (Recall) | 78.8 | RANet |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | J&F | 65.7 | RANet |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | Jaccard (Decay) | 18.6 | RANet |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | Jaccard (Mean) | 63.2 | RANet |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | Jaccard (Recall) | 73.7 | RANet |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | F-measure (Decay) | 8.2 | RANet+ (online learning) |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | F-measure (Mean) | 87.6 | RANet+ (online learning) |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | F-measure (Recall) | 96.1 | RANet+ (online learning) |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | J&F | 87.1 | RANet+ (online learning) |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | Jaccard (Decay) | 7.4 | RANet+ (online learning) |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | Jaccard (Mean) | 86.6 | RANet+ (online learning) |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | Jaccard (Recall) | 97 | RANet+ (online learning) |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | F-measure (Decay) | 5.1 | RANet |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | F-measure (Mean) | 85.4 | RANet |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | F-measure (Recall) | 94.9 | RANet |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | J&F | 85.45 | RANet |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | Jaccard (Decay) | 6.2 | RANet |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | Jaccard (Mean) | 85.5 | RANet |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | Jaccard (Recall) | 97.2 | RANet |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (test-dev) | F-measure (Decay) | 22.1 | RANet |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (test-dev) | F-measure (Mean) | 57.3 | RANet |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (test-dev) | F-measure (Recall) | 67.7 | RANet |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (test-dev) | J&F | 55.4 | RANet |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (test-dev) | Jaccard (Decay) | 21.9 | RANet |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (test-dev) | Jaccard (Mean) | 53.4 | RANet |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (test-dev) | Jaccard (Recall) | 61.9 | RANet |
| Semi-Supervised Video Object Segmentation | DAVIS (no YouTube-VOS training) | D16 val (F) | 85.4 | RANet |
| Semi-Supervised Video Object Segmentation | DAVIS (no YouTube-VOS training) | D16 val (G) | 85.5 | RANet |
| Semi-Supervised Video Object Segmentation | DAVIS (no YouTube-VOS training) | D16 val (J) | 85.5 | RANet |
| Semi-Supervised Video Object Segmentation | DAVIS (no YouTube-VOS training) | D17 test (F) | 57.2 | RANet |
| Semi-Supervised Video Object Segmentation | DAVIS (no YouTube-VOS training) | D17 test (G) | 55.3 | RANet |
| Semi-Supervised Video Object Segmentation | DAVIS (no YouTube-VOS training) | D17 test (J) | 53.4 | RANet |
| Semi-Supervised Video Object Segmentation | DAVIS (no YouTube-VOS training) | D17 val (F) | 68.2 | RANet |
| Semi-Supervised Video Object Segmentation | DAVIS (no YouTube-VOS training) | D17 val (G) | 65.7 | RANet |
| Semi-Supervised Video Object Segmentation | DAVIS (no YouTube-VOS training) | D17 val (J) | 63.2 | RANet |
| Semi-Supervised Video Object Segmentation | DAVIS (no YouTube-VOS training) | FPS | 30.3 | RANet |