Zhihui Lin, Tianyu Yang, Maomao Li, Ziyu Wang, Chun Yuan, Wenhao Jiang, Wei Liu
Matching-based methods, especially those based on space-time memory, are significantly ahead of other solutions in semi-supervised video object segmentation (VOS). However, continuously growing and redundant template features lead to an inefficient inference. To alleviate this, we propose a novel Sequential Weighted Expectation-Maximization (SWEM) network to greatly reduce the redundancy of memory features. Different from the previous methods which only detect feature redundancy between frames, SWEM merges both intra-frame and inter-frame similar features by leveraging the sequential weighted EM algorithm. Further, adaptive weights for frame features endow SWEM with the flexibility to represent hard samples, improving the discrimination of templates. Besides, the proposed method maintains a fixed number of template features in memory, which ensures the stable inference complexity of the VOS system. Extensive experiments on commonly used DAVIS and YouTube-VOS datasets verify the high efficiency (36 FPS) and high performance (84.3\% $\mathcal{J}\&\mathcal{F}$ on DAVIS 2017 validation dataset) of SWEM. Code is available at: https://github.com/lmm077/SWEM.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | MOSE | F | 54.9 | SWEM |
| Video | MOSE | J | 46.8 | SWEM |
| Video | MOSE | J&F | 50.9 | SWEM |
| Video | DAVIS 2017 (val) | F-measure (Mean) | 79.8 | SWEM |
| Video | DAVIS 2017 (val) | J&F | 77.2 | SWEM |
| Video | DAVIS 2017 (val) | Jaccard (Mean) | 74.5 | SWEM |
| Video | DAVIS 2016 | F-measure (Mean) | 89 | SWEM (val) |
| Video | DAVIS 2016 | J&F | 88.1 | SWEM (val) |
| Video | DAVIS 2016 | Jaccard (Mean) | 87.3 | SWEM (val) |
| Video | DAVIS 2016 | Speed (FPS) | 36 | SWEM (val) |
| Video | DAVIS (no YouTube-VOS training) | D16 val (F) | 89 | SWEM |
| Video | DAVIS (no YouTube-VOS training) | D16 val (G) | 88.1 | SWEM |
| Video | DAVIS (no YouTube-VOS training) | D16 val (J) | 87.3 | SWEM |
| Video | DAVIS (no YouTube-VOS training) | D17 val (F) | 79.8 | SWEM |
| Video | DAVIS (no YouTube-VOS training) | D17 val (G) | 77.2 | SWEM |
| Video | DAVIS (no YouTube-VOS training) | D17 val (J) | 74.5 | SWEM |
| Video | DAVIS (no YouTube-VOS training) | FPS | 36 | SWEM |
| Video Object Segmentation | MOSE | F | 54.9 | SWEM |
| Video Object Segmentation | MOSE | J | 46.8 | SWEM |
| Video Object Segmentation | MOSE | J&F | 50.9 | SWEM |
| Video Object Segmentation | DAVIS 2017 (val) | F-measure (Mean) | 79.8 | SWEM |
| Video Object Segmentation | DAVIS 2017 (val) | J&F | 77.2 | SWEM |
| Video Object Segmentation | DAVIS 2017 (val) | Jaccard (Mean) | 74.5 | SWEM |
| Video Object Segmentation | DAVIS 2016 | F-measure (Mean) | 89 | SWEM (val) |
| Video Object Segmentation | DAVIS 2016 | J&F | 88.1 | SWEM (val) |
| Video Object Segmentation | DAVIS 2016 | Jaccard (Mean) | 87.3 | SWEM (val) |
| Video Object Segmentation | DAVIS 2016 | Speed (FPS) | 36 | SWEM (val) |
| Video Object Segmentation | DAVIS (no YouTube-VOS training) | D16 val (F) | 89 | SWEM |
| Video Object Segmentation | DAVIS (no YouTube-VOS training) | D16 val (G) | 88.1 | SWEM |
| Video Object Segmentation | DAVIS (no YouTube-VOS training) | D16 val (J) | 87.3 | SWEM |
| Video Object Segmentation | DAVIS (no YouTube-VOS training) | D17 val (F) | 79.8 | SWEM |
| Video Object Segmentation | DAVIS (no YouTube-VOS training) | D17 val (G) | 77.2 | SWEM |
| Video Object Segmentation | DAVIS (no YouTube-VOS training) | D17 val (J) | 74.5 | SWEM |
| Video Object Segmentation | DAVIS (no YouTube-VOS training) | FPS | 36 | SWEM |
| Semi-Supervised Video Object Segmentation | MOSE | F | 54.9 | SWEM |
| Semi-Supervised Video Object Segmentation | MOSE | J | 46.8 | SWEM |
| Semi-Supervised Video Object Segmentation | MOSE | J&F | 50.9 | SWEM |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | F-measure (Mean) | 79.8 | SWEM |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | J&F | 77.2 | SWEM |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | Jaccard (Mean) | 74.5 | SWEM |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | F-measure (Mean) | 89 | SWEM (val) |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | J&F | 88.1 | SWEM (val) |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | Jaccard (Mean) | 87.3 | SWEM (val) |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | Speed (FPS) | 36 | SWEM (val) |
| Semi-Supervised Video Object Segmentation | DAVIS (no YouTube-VOS training) | D16 val (F) | 89 | SWEM |
| Semi-Supervised Video Object Segmentation | DAVIS (no YouTube-VOS training) | D16 val (G) | 88.1 | SWEM |
| Semi-Supervised Video Object Segmentation | DAVIS (no YouTube-VOS training) | D16 val (J) | 87.3 | SWEM |
| Semi-Supervised Video Object Segmentation | DAVIS (no YouTube-VOS training) | D17 val (F) | 79.8 | SWEM |
| Semi-Supervised Video Object Segmentation | DAVIS (no YouTube-VOS training) | D17 val (G) | 77.2 | SWEM |
| Semi-Supervised Video Object Segmentation | DAVIS (no YouTube-VOS training) | D17 val (J) | 74.5 | SWEM |
| Semi-Supervised Video Object Segmentation | DAVIS (no YouTube-VOS training) | FPS | 36 | SWEM |