Hongje Seong, Junhyuk Hyun, Euntai Kim
Semi-supervised video object segmentation (VOS) is a task that involves predicting a target object in a video when the ground truth segmentation mask of the target object is given in the first frame. Recently, space-time memory networks (STM) have received significant attention as a promising solution for semi-supervised VOS. However, an important point is overlooked when applying STM to VOS. The solution (STM) is non-local, but the problem (VOS) is predominantly local. To solve the mismatch between STM and VOS, we propose a kernelized memory network (KMN). Before being trained on real videos, our KMN is pre-trained on static images, as in previous works. Unlike in previous works, we use the Hide-and-Seek strategy in pre-training to obtain the best possible results in handling occlusions and segment boundary extraction. The proposed KMN surpasses the state-of-the-art on standard benchmarks by a significant margin (+5% on DAVIS 2017 test-dev set). In addition, the runtime of KMN is 0.12 seconds per frame on the DAVIS 2016 validation set, and the KMN rarely requires extra computation, when compared with STM.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | DAVIS 2017 (test-dev) | F-measure | 80.3 | KMN |
| Video | DAVIS 2017 (test-dev) | Jaccard | 74.1 | KMN |
| Video | DAVIS 2017 (test-dev) | Mean Jaccard & F-Measure | 77.2 | KMN |
| Video | YouTube-VOS 2018 | F-Measure (Seen) | 85.6 | KMN |
| Video | YouTube-VOS 2018 | F-Measure (Unseen) | 83.3 | KMN |
| Video | YouTube-VOS 2018 | Jaccard (Seen) | 81.4 | KMN |
| Video | YouTube-VOS 2018 | Mean Jaccard & F-Measure | 81.4 | KMN |
| Video | DAVIS 2017 (val) | F-measure | 85.6 | KMN |
| Video | DAVIS 2017 (val) | Jaccard | 80 | KMN |
| Video | DAVIS 2017 (val) | Mean Jaccard & F-Measure | 82.8 | KMN |
| Video | DAVIS 2017 (val) | F-measure (Mean) | 85.6 | KMN |
| Video | DAVIS 2017 (val) | J&F | 82.8 | KMN |
| Video | DAVIS 2017 (val) | Jaccard (Mean) | 80 | KMN |
| Video | DAVIS 2016 | F-measure (Mean) | 91.5 | KMN |
| Video | DAVIS 2016 | J&F | 90.5 | KMN |
| Video | DAVIS 2016 | Jaccard (Mean) | 89.5 | KMN |
| Video | DAVIS 2017 (test-dev) | F-measure (Mean) | 80.3 | KMN |
| Video | DAVIS 2017 (test-dev) | J&F | 77.2 | KMN |
| Video | DAVIS 2017 (test-dev) | Jaccard (Mean) | 74.1 | KMN |
| Video | DAVIS (no YouTube-VOS training) | D16 val (F) | 88.1 | KMN |
| Video | DAVIS (no YouTube-VOS training) | D16 val (G) | 87.6 | KMN |
| Video | DAVIS (no YouTube-VOS training) | D16 val (J) | 87.1 | KMN |
| Video | DAVIS (no YouTube-VOS training) | D17 val (F) | 77.8 | KMN |
| Video | DAVIS (no YouTube-VOS training) | D17 val (G) | 76 | KMN |
| Video | DAVIS (no YouTube-VOS training) | D17 val (J) | 74.2 | KMN |
| Video | DAVIS (no YouTube-VOS training) | FPS | 8.33 | KMN |
| Video | YouTube-VOS 2018 | F-Measure (Seen) | 85.6 | KMN |
| Video | YouTube-VOS 2018 | F-Measure (Unseen) | 83.3 | KMN |
| Video | YouTube-VOS 2018 | Jaccard (Seen) | 81.4 | KMN |
| Video | YouTube-VOS 2018 | Overall | 81.4 | KMN |
| Video Object Segmentation | DAVIS 2017 (test-dev) | F-measure | 80.3 | KMN |
| Video Object Segmentation | DAVIS 2017 (test-dev) | Jaccard | 74.1 | KMN |
| Video Object Segmentation | DAVIS 2017 (test-dev) | Mean Jaccard & F-Measure | 77.2 | KMN |
| Video Object Segmentation | YouTube-VOS 2018 | F-Measure (Seen) | 85.6 | KMN |
| Video Object Segmentation | YouTube-VOS 2018 | F-Measure (Unseen) | 83.3 | KMN |
| Video Object Segmentation | YouTube-VOS 2018 | Jaccard (Seen) | 81.4 | KMN |
| Video Object Segmentation | YouTube-VOS 2018 | Mean Jaccard & F-Measure | 81.4 | KMN |
| Video Object Segmentation | DAVIS 2017 (val) | F-measure | 85.6 | KMN |
| Video Object Segmentation | DAVIS 2017 (val) | Jaccard | 80 | KMN |
| Video Object Segmentation | DAVIS 2017 (val) | Mean Jaccard & F-Measure | 82.8 | KMN |
| Video Object Segmentation | DAVIS 2017 (val) | F-measure (Mean) | 85.6 | KMN |
| Video Object Segmentation | DAVIS 2017 (val) | J&F | 82.8 | KMN |
| Video Object Segmentation | DAVIS 2017 (val) | Jaccard (Mean) | 80 | KMN |
| Video Object Segmentation | DAVIS 2016 | F-measure (Mean) | 91.5 | KMN |
| Video Object Segmentation | DAVIS 2016 | J&F | 90.5 | KMN |
| Video Object Segmentation | DAVIS 2016 | Jaccard (Mean) | 89.5 | KMN |
| Video Object Segmentation | DAVIS 2017 (test-dev) | F-measure (Mean) | 80.3 | KMN |
| Video Object Segmentation | DAVIS 2017 (test-dev) | J&F | 77.2 | KMN |
| Video Object Segmentation | DAVIS 2017 (test-dev) | Jaccard (Mean) | 74.1 | KMN |
| Video Object Segmentation | DAVIS (no YouTube-VOS training) | D16 val (F) | 88.1 | KMN |
| Video Object Segmentation | DAVIS (no YouTube-VOS training) | D16 val (G) | 87.6 | KMN |
| Video Object Segmentation | DAVIS (no YouTube-VOS training) | D16 val (J) | 87.1 | KMN |
| Video Object Segmentation | DAVIS (no YouTube-VOS training) | D17 val (F) | 77.8 | KMN |
| Video Object Segmentation | DAVIS (no YouTube-VOS training) | D17 val (G) | 76 | KMN |
| Video Object Segmentation | DAVIS (no YouTube-VOS training) | D17 val (J) | 74.2 | KMN |
| Video Object Segmentation | DAVIS (no YouTube-VOS training) | FPS | 8.33 | KMN |
| Video Object Segmentation | YouTube-VOS 2018 | F-Measure (Seen) | 85.6 | KMN |
| Video Object Segmentation | YouTube-VOS 2018 | F-Measure (Unseen) | 83.3 | KMN |
| Video Object Segmentation | YouTube-VOS 2018 | Jaccard (Seen) | 81.4 | KMN |
| Video Object Segmentation | YouTube-VOS 2018 | Overall | 81.4 | KMN |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | F-measure (Mean) | 85.6 | KMN |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | J&F | 82.8 | KMN |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | Jaccard (Mean) | 80 | KMN |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | F-measure (Mean) | 91.5 | KMN |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | J&F | 90.5 | KMN |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | Jaccard (Mean) | 89.5 | KMN |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (test-dev) | F-measure (Mean) | 80.3 | KMN |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (test-dev) | J&F | 77.2 | KMN |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (test-dev) | Jaccard (Mean) | 74.1 | KMN |
| Semi-Supervised Video Object Segmentation | DAVIS (no YouTube-VOS training) | D16 val (F) | 88.1 | KMN |
| Semi-Supervised Video Object Segmentation | DAVIS (no YouTube-VOS training) | D16 val (G) | 87.6 | KMN |
| Semi-Supervised Video Object Segmentation | DAVIS (no YouTube-VOS training) | D16 val (J) | 87.1 | KMN |
| Semi-Supervised Video Object Segmentation | DAVIS (no YouTube-VOS training) | D17 val (F) | 77.8 | KMN |
| Semi-Supervised Video Object Segmentation | DAVIS (no YouTube-VOS training) | D17 val (G) | 76 | KMN |
| Semi-Supervised Video Object Segmentation | DAVIS (no YouTube-VOS training) | D17 val (J) | 74.2 | KMN |
| Semi-Supervised Video Object Segmentation | DAVIS (no YouTube-VOS training) | FPS | 8.33 | KMN |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | F-Measure (Seen) | 85.6 | KMN |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | F-Measure (Unseen) | 83.3 | KMN |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Jaccard (Seen) | 81.4 | KMN |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Overall | 81.4 | KMN |