Haozhe Xie, Hongxun Yao, Shangchen Zhou, Shengping Zhang, Wenxiu Sun
Recently, several Space-Time Memory based networks have shown that the object cues (e.g. video frames as well as the segmented object masks) from the past frames are useful for segmenting objects in the current frame. However, these methods exploit the information from the memory by global-to-global matching between the current and past frames, which lead to mismatching to similar objects and high computational complexity. To address these problems, we propose a novel local-to-local matching solution for semi-supervised VOS, namely Regional Memory Network (RMNet). In RMNet, the precise regional memory is constructed by memorizing local regions where the target objects appear in the past frames. For the current query frame, the query regions are tracked and predicted based on the optical flow estimated from the previous frame. The proposed local-to-local matching effectively alleviates the ambiguity of similar objects in both memory and query frames, which allows the information to be passed from the regional memory to the query region efficiently and effectively. Experimental results indicate that the proposed RMNet performs favorably against state-of-the-art methods on the DAVIS and YouTube-VOS datasets.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | DAVIS 2017 (val) | F-measure (Mean) | 86 | RMNet |
| Video | DAVIS 2017 (val) | J&F | 83.5 | RMNet |
| Video | DAVIS 2017 (val) | Jaccard (Mean) | 81 | RMNet |
| Video | DAVIS 2016 | F-measure (Mean) | 88.7 | RMNet |
| Video | DAVIS 2016 | J&F | 88.8 | RMNet |
| Video | DAVIS 2016 | Jaccard (Mean) | 88.9 | RMNet |
| Video | DAVIS 2017 (test-dev) | F-measure (Mean) | 78.1 | RMNet |
| Video | DAVIS 2017 (test-dev) | J&F | 75 | RMNet |
| Video | DAVIS 2017 (test-dev) | Jaccard (Mean) | 71.9 | RMNet |
| Video | DAVIS (no YouTube-VOS training) | D16 val (F) | 82.3 | RMNet |
| Video | DAVIS (no YouTube-VOS training) | D16 val (G) | 81.5 | RMNet |
| Video | DAVIS (no YouTube-VOS training) | D16 val (J) | 80.6 | RMNet |
| Video | DAVIS (no YouTube-VOS training) | D17 val (F) | 77.2 | RMNet |
| Video | DAVIS (no YouTube-VOS training) | D17 val (G) | 75 | RMNet |
| Video | DAVIS (no YouTube-VOS training) | D17 val (J) | 72.8 | RMNet |
| Video | DAVIS (no YouTube-VOS training) | FPS | 11.9 | RMNet |
| Video | YouTube-VOS 2018 | F-Measure (Seen) | 85.7 | RMNet |
| Video | YouTube-VOS 2018 | F-Measure (Unseen) | 82.4 | RMNet |
| Video | YouTube-VOS 2018 | Jaccard (Seen) | 82.1 | RMNet |
| Video | YouTube-VOS 2018 | Jaccard (Unseen) | 75.7 | RMNet |
| Video | YouTube-VOS 2018 | Overall | 81.5 | RMNet |
| Video Object Segmentation | DAVIS 2017 (val) | F-measure (Mean) | 86 | RMNet |
| Video Object Segmentation | DAVIS 2017 (val) | J&F | 83.5 | RMNet |
| Video Object Segmentation | DAVIS 2017 (val) | Jaccard (Mean) | 81 | RMNet |
| Video Object Segmentation | DAVIS 2016 | F-measure (Mean) | 88.7 | RMNet |
| Video Object Segmentation | DAVIS 2016 | J&F | 88.8 | RMNet |
| Video Object Segmentation | DAVIS 2016 | Jaccard (Mean) | 88.9 | RMNet |
| Video Object Segmentation | DAVIS 2017 (test-dev) | F-measure (Mean) | 78.1 | RMNet |
| Video Object Segmentation | DAVIS 2017 (test-dev) | J&F | 75 | RMNet |
| Video Object Segmentation | DAVIS 2017 (test-dev) | Jaccard (Mean) | 71.9 | RMNet |
| Video Object Segmentation | DAVIS (no YouTube-VOS training) | D16 val (F) | 82.3 | RMNet |
| Video Object Segmentation | DAVIS (no YouTube-VOS training) | D16 val (G) | 81.5 | RMNet |
| Video Object Segmentation | DAVIS (no YouTube-VOS training) | D16 val (J) | 80.6 | RMNet |
| Video Object Segmentation | DAVIS (no YouTube-VOS training) | D17 val (F) | 77.2 | RMNet |
| Video Object Segmentation | DAVIS (no YouTube-VOS training) | D17 val (G) | 75 | RMNet |
| Video Object Segmentation | DAVIS (no YouTube-VOS training) | D17 val (J) | 72.8 | RMNet |
| Video Object Segmentation | DAVIS (no YouTube-VOS training) | FPS | 11.9 | RMNet |
| Video Object Segmentation | YouTube-VOS 2018 | F-Measure (Seen) | 85.7 | RMNet |
| Video Object Segmentation | YouTube-VOS 2018 | F-Measure (Unseen) | 82.4 | RMNet |
| Video Object Segmentation | YouTube-VOS 2018 | Jaccard (Seen) | 82.1 | RMNet |
| Video Object Segmentation | YouTube-VOS 2018 | Jaccard (Unseen) | 75.7 | RMNet |
| Video Object Segmentation | YouTube-VOS 2018 | Overall | 81.5 | RMNet |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | F-measure (Mean) | 86 | RMNet |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | J&F | 83.5 | RMNet |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | Jaccard (Mean) | 81 | RMNet |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | F-measure (Mean) | 88.7 | RMNet |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | J&F | 88.8 | RMNet |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | Jaccard (Mean) | 88.9 | RMNet |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (test-dev) | F-measure (Mean) | 78.1 | RMNet |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (test-dev) | J&F | 75 | RMNet |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (test-dev) | Jaccard (Mean) | 71.9 | RMNet |
| Semi-Supervised Video Object Segmentation | DAVIS (no YouTube-VOS training) | D16 val (F) | 82.3 | RMNet |
| Semi-Supervised Video Object Segmentation | DAVIS (no YouTube-VOS training) | D16 val (G) | 81.5 | RMNet |
| Semi-Supervised Video Object Segmentation | DAVIS (no YouTube-VOS training) | D16 val (J) | 80.6 | RMNet |
| Semi-Supervised Video Object Segmentation | DAVIS (no YouTube-VOS training) | D17 val (F) | 77.2 | RMNet |
| Semi-Supervised Video Object Segmentation | DAVIS (no YouTube-VOS training) | D17 val (G) | 75 | RMNet |
| Semi-Supervised Video Object Segmentation | DAVIS (no YouTube-VOS training) | D17 val (J) | 72.8 | RMNet |
| Semi-Supervised Video Object Segmentation | DAVIS (no YouTube-VOS training) | FPS | 11.9 | RMNet |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | F-Measure (Seen) | 85.7 | RMNet |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | F-Measure (Unseen) | 82.4 | RMNet |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Jaccard (Seen) | 82.1 | RMNet |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Jaccard (Unseen) | 75.7 | RMNet |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Overall | 81.5 | RMNet |