Yongqing Liang, Xin Li, Navid Jafari, Qin Chen
We propose a new matching-based framework for semi-supervised video object segmentation (VOS). Recently, state-of-the-art VOS performance has been achieved by matching-based algorithms, in which feature banks are created to store features for region matching and classification. However, how to effectively organize information in the continuously growing feature bank remains under-explored, and this leads to inefficient design of the bank. We introduce an adaptive feature bank update scheme to dynamically absorb new features and discard obsolete features. We also design a new confidence loss and a fine-grained segmentation module to enhance the segmentation accuracy in uncertain regions. On public benchmarks, our algorithm outperforms existing state-of-the-arts.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | YouTube-VOS 2018 | F-Measure (Seen) | 83.1 | AFB-URR |
| Video | YouTube-VOS 2018 | F-Measure (Unseen) | 82.6 | AFB-URR |
| Video | YouTube-VOS 2018 | Jaccard (Seen) | 78.8 | AFB-URR |
| Video | YouTube-VOS 2018 | Jaccard (Unseen) | 74.1 | AFB-URR |
| Video | YouTube-VOS 2018 | Mean Jaccard & F-Measure | 79.6 | AFB-URR |
| Video | DAVIS 2017 (val) | F-measure | 76.1 | AFB-URR |
| Video | DAVIS 2017 (val) | Jaccard | 73 | AFB-URR |
| Video | DAVIS 2017 (val) | Mean Jaccard & F-Measure | 74.6 | AFB-URR |
| Video | DAVIS 2017 (val) | F-measure (Decay) | 15.5 | AFB-URR |
| Video | DAVIS 2017 (val) | F-measure (Mean) | 76.1 | AFB-URR |
| Video | DAVIS 2017 (val) | F-measure (Recall) | 87 | AFB-URR |
| Video | DAVIS 2017 (val) | J&F | 74.6 | AFB-URR |
| Video | DAVIS 2017 (val) | Jaccard (Decay) | 13.8 | AFB-URR |
| Video | DAVIS 2017 (val) | Jaccard (Mean) | 73 | AFB-URR |
| Video | DAVIS 2017 (val) | Jaccard (Recall) | 85.3 | AFB-URR |
| Video | Long Video Dataset (3X) | F | 84.6 | AFB-URR |
| Video | Long Video Dataset (3X) | J | 82.9 | AFB-URR |
| Video | Long Video Dataset (3X) | J&F | 83.8 | AFB-URR |
| Video | Long Video Dataset | F | 84.5 | AFB-URR |
| Video | Long Video Dataset | J | 82.9 | AFB-URR |
| Video | Long Video Dataset | J&F | 83.7 | AFB-URR |
| Video | DAVIS (no YouTube-VOS training) | D17 val (F) | 76.1 | AFB-URR |
| Video | DAVIS (no YouTube-VOS training) | D17 val (G) | 74.6 | AFB-URR |
| Video | DAVIS (no YouTube-VOS training) | D17 val (J) | 73 | AFB-URR |
| Video | DAVIS (no YouTube-VOS training) | FPS | 4 | AFB-URR |
| Video | YouTube-VOS 2018 | F-Measure (Seen) | 83.1 | AFB-URR |
| Video | YouTube-VOS 2018 | F-Measure (Unseen) | 82.6 | AFB-URR |
| Video | YouTube-VOS 2018 | Jaccard (Seen) | 78.8 | AFB-URR |
| Video | YouTube-VOS 2018 | Jaccard (Unseen) | 74.1 | AFB-URR |
| Video | YouTube-VOS 2018 | Overall | 79.6 | AFB-URR |
| Video Object Segmentation | YouTube-VOS 2018 | F-Measure (Seen) | 83.1 | AFB-URR |
| Video Object Segmentation | YouTube-VOS 2018 | F-Measure (Unseen) | 82.6 | AFB-URR |
| Video Object Segmentation | YouTube-VOS 2018 | Jaccard (Seen) | 78.8 | AFB-URR |
| Video Object Segmentation | YouTube-VOS 2018 | Jaccard (Unseen) | 74.1 | AFB-URR |
| Video Object Segmentation | YouTube-VOS 2018 | Mean Jaccard & F-Measure | 79.6 | AFB-URR |
| Video Object Segmentation | DAVIS 2017 (val) | F-measure | 76.1 | AFB-URR |
| Video Object Segmentation | DAVIS 2017 (val) | Jaccard | 73 | AFB-URR |
| Video Object Segmentation | DAVIS 2017 (val) | Mean Jaccard & F-Measure | 74.6 | AFB-URR |
| Video Object Segmentation | DAVIS 2017 (val) | F-measure (Decay) | 15.5 | AFB-URR |
| Video Object Segmentation | DAVIS 2017 (val) | F-measure (Mean) | 76.1 | AFB-URR |
| Video Object Segmentation | DAVIS 2017 (val) | F-measure (Recall) | 87 | AFB-URR |
| Video Object Segmentation | DAVIS 2017 (val) | J&F | 74.6 | AFB-URR |
| Video Object Segmentation | DAVIS 2017 (val) | Jaccard (Decay) | 13.8 | AFB-URR |
| Video Object Segmentation | DAVIS 2017 (val) | Jaccard (Mean) | 73 | AFB-URR |
| Video Object Segmentation | DAVIS 2017 (val) | Jaccard (Recall) | 85.3 | AFB-URR |
| Video Object Segmentation | Long Video Dataset (3X) | F | 84.6 | AFB-URR |
| Video Object Segmentation | Long Video Dataset (3X) | J | 82.9 | AFB-URR |
| Video Object Segmentation | Long Video Dataset (3X) | J&F | 83.8 | AFB-URR |
| Video Object Segmentation | Long Video Dataset | F | 84.5 | AFB-URR |
| Video Object Segmentation | Long Video Dataset | J | 82.9 | AFB-URR |
| Video Object Segmentation | Long Video Dataset | J&F | 83.7 | AFB-URR |
| Video Object Segmentation | DAVIS (no YouTube-VOS training) | D17 val (F) | 76.1 | AFB-URR |
| Video Object Segmentation | DAVIS (no YouTube-VOS training) | D17 val (G) | 74.6 | AFB-URR |
| Video Object Segmentation | DAVIS (no YouTube-VOS training) | D17 val (J) | 73 | AFB-URR |
| Video Object Segmentation | DAVIS (no YouTube-VOS training) | FPS | 4 | AFB-URR |
| Video Object Segmentation | YouTube-VOS 2018 | F-Measure (Seen) | 83.1 | AFB-URR |
| Video Object Segmentation | YouTube-VOS 2018 | F-Measure (Unseen) | 82.6 | AFB-URR |
| Video Object Segmentation | YouTube-VOS 2018 | Jaccard (Seen) | 78.8 | AFB-URR |
| Video Object Segmentation | YouTube-VOS 2018 | Jaccard (Unseen) | 74.1 | AFB-URR |
| Video Object Segmentation | YouTube-VOS 2018 | Overall | 79.6 | AFB-URR |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | F-measure (Decay) | 15.5 | AFB-URR |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | F-measure (Mean) | 76.1 | AFB-URR |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | F-measure (Recall) | 87 | AFB-URR |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | J&F | 74.6 | AFB-URR |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | Jaccard (Decay) | 13.8 | AFB-URR |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | Jaccard (Mean) | 73 | AFB-URR |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | Jaccard (Recall) | 85.3 | AFB-URR |
| Semi-Supervised Video Object Segmentation | Long Video Dataset (3X) | F | 84.6 | AFB-URR |
| Semi-Supervised Video Object Segmentation | Long Video Dataset (3X) | J | 82.9 | AFB-URR |
| Semi-Supervised Video Object Segmentation | Long Video Dataset (3X) | J&F | 83.8 | AFB-URR |
| Semi-Supervised Video Object Segmentation | Long Video Dataset | F | 84.5 | AFB-URR |
| Semi-Supervised Video Object Segmentation | Long Video Dataset | J | 82.9 | AFB-URR |
| Semi-Supervised Video Object Segmentation | Long Video Dataset | J&F | 83.7 | AFB-URR |
| Semi-Supervised Video Object Segmentation | DAVIS (no YouTube-VOS training) | D17 val (F) | 76.1 | AFB-URR |
| Semi-Supervised Video Object Segmentation | DAVIS (no YouTube-VOS training) | D17 val (G) | 74.6 | AFB-URR |
| Semi-Supervised Video Object Segmentation | DAVIS (no YouTube-VOS training) | D17 val (J) | 73 | AFB-URR |
| Semi-Supervised Video Object Segmentation | DAVIS (no YouTube-VOS training) | FPS | 4 | AFB-URR |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | F-Measure (Seen) | 83.1 | AFB-URR |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | F-Measure (Unseen) | 82.6 | AFB-URR |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Jaccard (Seen) | 78.8 | AFB-URR |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Jaccard (Unseen) | 74.1 | AFB-URR |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Overall | 79.6 | AFB-URR |