Hongje Seong, Seoung Wug Oh, Joon-Young Lee, Seongwon Lee, Suhyeon Lee, Euntai Kim
We present Hierarchical Memory Matching Network (HMMN) for semi-supervised video object segmentation. Based on a recent memory-based method [33], we propose two advanced memory read modules that enable us to perform memory reading in multiple scales while exploiting temporal smoothness. We first propose a kernel guided memory matching module that replaces the non-local dense memory read, commonly adopted in previous memory-based methods. The module imposes the temporal smoothness constraint in the memory read, leading to accurate memory retrieval. More importantly, we introduce a hierarchical memory matching scheme and propose a top-k guided memory matching module in which memory read on a fine-scale is guided by that on a coarse-scale. With the module, we perform memory read in multiple scales efficiently and leverage both high-level semantic and low-level fine-grained memory features to predict detailed object masks. Our network achieves state-of-the-art performance on the validation sets of DAVIS 2016/2017 (90.8% and 84.7%) and YouTube-VOS 2018/2019 (82.6% and 82.5%), and test-dev set of DAVIS 2017 (78.6%). The source code and model are available online: https://github.com/Hongje/HMMN.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | DAVIS 2017 (val) | F-measure (Mean) | 87.5 | HMMN |
| Video | DAVIS 2017 (val) | J&F | 84.7 | HMMN |
| Video | DAVIS 2017 (val) | Jaccard (Mean) | 81.9 | HMMN |
| Video | DAVIS 2016 | F-measure (Mean) | 92 | HMMN |
| Video | DAVIS 2016 | J&F | 90.8 | HMMN |
| Video | DAVIS 2016 | Jaccard (Mean) | 89.6 | HMMN |
| Video | DAVIS 2017 (test-dev) | F-measure (Mean) | 82.5 | HMMN |
| Video | DAVIS 2017 (test-dev) | J&F | 78.6 | HMMN |
| Video | DAVIS 2017 (test-dev) | Jaccard (Mean) | 74.7 | HMMN |
| Video | DAVIS (no YouTube-VOS training) | D16 val (F) | 90.6 | HMMN |
| Video | DAVIS (no YouTube-VOS training) | D16 val (G) | 89.4 | HMMN |
| Video | DAVIS (no YouTube-VOS training) | D16 val (J) | 88.2 | HMMN |
| Video | DAVIS (no YouTube-VOS training) | D17 val (F) | 83.1 | HMMN |
| Video | DAVIS (no YouTube-VOS training) | D17 val (G) | 80.4 | HMMN |
| Video | DAVIS (no YouTube-VOS training) | D17 val (J) | 77.7 | HMMN |
| Video | DAVIS (no YouTube-VOS training) | FPS | 10 | HMMN |
| Video | YouTube-VOS 2018 | F-Measure (Seen) | 87 | HMMN |
| Video | YouTube-VOS 2018 | F-Measure (Unseen) | 84.6 | HMMN |
| Video | YouTube-VOS 2018 | Jaccard (Seen) | 82.1 | HMMN |
| Video | YouTube-VOS 2018 | Jaccard (Unseen) | 76.8 | HMMN |
| Video | YouTube-VOS 2018 | Overall | 82.6 | HMMN |
| Video Object Segmentation | DAVIS 2017 (val) | F-measure (Mean) | 87.5 | HMMN |
| Video Object Segmentation | DAVIS 2017 (val) | J&F | 84.7 | HMMN |
| Video Object Segmentation | DAVIS 2017 (val) | Jaccard (Mean) | 81.9 | HMMN |
| Video Object Segmentation | DAVIS 2016 | F-measure (Mean) | 92 | HMMN |
| Video Object Segmentation | DAVIS 2016 | J&F | 90.8 | HMMN |
| Video Object Segmentation | DAVIS 2016 | Jaccard (Mean) | 89.6 | HMMN |
| Video Object Segmentation | DAVIS 2017 (test-dev) | F-measure (Mean) | 82.5 | HMMN |
| Video Object Segmentation | DAVIS 2017 (test-dev) | J&F | 78.6 | HMMN |
| Video Object Segmentation | DAVIS 2017 (test-dev) | Jaccard (Mean) | 74.7 | HMMN |
| Video Object Segmentation | DAVIS (no YouTube-VOS training) | D16 val (F) | 90.6 | HMMN |
| Video Object Segmentation | DAVIS (no YouTube-VOS training) | D16 val (G) | 89.4 | HMMN |
| Video Object Segmentation | DAVIS (no YouTube-VOS training) | D16 val (J) | 88.2 | HMMN |
| Video Object Segmentation | DAVIS (no YouTube-VOS training) | D17 val (F) | 83.1 | HMMN |
| Video Object Segmentation | DAVIS (no YouTube-VOS training) | D17 val (G) | 80.4 | HMMN |
| Video Object Segmentation | DAVIS (no YouTube-VOS training) | D17 val (J) | 77.7 | HMMN |
| Video Object Segmentation | DAVIS (no YouTube-VOS training) | FPS | 10 | HMMN |
| Video Object Segmentation | YouTube-VOS 2018 | F-Measure (Seen) | 87 | HMMN |
| Video Object Segmentation | YouTube-VOS 2018 | F-Measure (Unseen) | 84.6 | HMMN |
| Video Object Segmentation | YouTube-VOS 2018 | Jaccard (Seen) | 82.1 | HMMN |
| Video Object Segmentation | YouTube-VOS 2018 | Jaccard (Unseen) | 76.8 | HMMN |
| Video Object Segmentation | YouTube-VOS 2018 | Overall | 82.6 | HMMN |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | F-measure (Mean) | 87.5 | HMMN |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | J&F | 84.7 | HMMN |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | Jaccard (Mean) | 81.9 | HMMN |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | F-measure (Mean) | 92 | HMMN |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | J&F | 90.8 | HMMN |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | Jaccard (Mean) | 89.6 | HMMN |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (test-dev) | F-measure (Mean) | 82.5 | HMMN |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (test-dev) | J&F | 78.6 | HMMN |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (test-dev) | Jaccard (Mean) | 74.7 | HMMN |
| Semi-Supervised Video Object Segmentation | DAVIS (no YouTube-VOS training) | D16 val (F) | 90.6 | HMMN |
| Semi-Supervised Video Object Segmentation | DAVIS (no YouTube-VOS training) | D16 val (G) | 89.4 | HMMN |
| Semi-Supervised Video Object Segmentation | DAVIS (no YouTube-VOS training) | D16 val (J) | 88.2 | HMMN |
| Semi-Supervised Video Object Segmentation | DAVIS (no YouTube-VOS training) | D17 val (F) | 83.1 | HMMN |
| Semi-Supervised Video Object Segmentation | DAVIS (no YouTube-VOS training) | D17 val (G) | 80.4 | HMMN |
| Semi-Supervised Video Object Segmentation | DAVIS (no YouTube-VOS training) | D17 val (J) | 77.7 | HMMN |
| Semi-Supervised Video Object Segmentation | DAVIS (no YouTube-VOS training) | FPS | 10 | HMMN |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | F-Measure (Seen) | 87 | HMMN |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | F-Measure (Unseen) | 84.6 | HMMN |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Jaccard (Seen) | 82.1 | HMMN |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Jaccard (Unseen) | 76.8 | HMMN |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Overall | 82.6 | HMMN |