Bo Miao, Mohammed Bennamoun, Yongsheng Gao, Mubarak Shah, Ajmal Mian
Referring Video Object Segmentation (R-VOS) methods face challenges in maintaining consistent object segmentation due to temporal context variability and the presence of other visually similar objects. We propose an end-to-end R-VOS paradigm that explicitly models temporal instance consistency alongside the referring segmentation. Specifically, we introduce a novel hybrid memory that facilitates inter-frame collaboration for robust spatio-temporal matching and propagation. Features of frames with automatically generated high-quality reference masks are propagated to segment the remaining frames based on multi-granularity association to achieve temporally consistent R-VOS. Furthermore, we propose a new Mask Consistency Score (MCS) metric to evaluate the temporal consistency of video segmentation. Extensive experiments demonstrate that our approach enhances temporal consistency by a significant margin, leading to top-ranked performance on popular R-VOS benchmarks, i.e., Ref-YouTube-VOS (67.1%) and Ref-DAVIS17 (65.6%). The code is available at https://github.com/bo-miao/HTR.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | MeViS | F | 45.5 | HTR |
| Video | MeViS | J | 39.9 | HTR |
| Video | MeViS | J&F | 42.7 | HTR |
| Video | Refer-YouTube-VOS | F | 68.9 | HTR |
| Video | Refer-YouTube-VOS | J | 65.3 | HTR |
| Video | Refer-YouTube-VOS | J&F | 67.1 | HTR |
| Instance Segmentation | Refer-YouTube-VOS (2021 public validation) | F | 68.9 | HTR (Pre-training) |
| Instance Segmentation | Refer-YouTube-VOS (2021 public validation) | J | 65.3 | HTR (Pre-training) |
| Instance Segmentation | Refer-YouTube-VOS (2021 public validation) | J&F | 67.1 | HTR (Pre-training) |
| Instance Segmentation | DAVIS 2017 (val) | J&F 1st frame | 65.6 | HTR |
| Video Object Segmentation | MeViS | F | 45.5 | HTR |
| Video Object Segmentation | MeViS | J | 39.9 | HTR |
| Video Object Segmentation | MeViS | J&F | 42.7 | HTR |
| Video Object Segmentation | Refer-YouTube-VOS | F | 68.9 | HTR |
| Video Object Segmentation | Refer-YouTube-VOS | J | 65.3 | HTR |
| Video Object Segmentation | Refer-YouTube-VOS | J&F | 67.1 | HTR |
| Referring Expression Segmentation | Refer-YouTube-VOS (2021 public validation) | F | 68.9 | HTR (Pre-training) |
| Referring Expression Segmentation | Refer-YouTube-VOS (2021 public validation) | J | 65.3 | HTR (Pre-training) |
| Referring Expression Segmentation | Refer-YouTube-VOS (2021 public validation) | J&F | 67.1 | HTR (Pre-training) |
| Referring Expression Segmentation | DAVIS 2017 (val) | J&F 1st frame | 65.6 | HTR |