Temporally Consistent Referring Video Object Segmentation with Hybrid Memory

Bo Miao, Mohammed Bennamoun, Yongsheng Gao, Mubarak Shah, Ajmal Mian

2024-03-28Referring Video Object Segmentation Referring Expression Segmentation Segmentation Semantic Segmentation Video Segmentation Video Object Segmentation Video Semantic Segmentation HTR

Paper PDF Code(official)

Abstract

Referring Video Object Segmentation (R-VOS) methods face challenges in maintaining consistent object segmentation due to temporal context variability and the presence of other visually similar objects. We propose an end-to-end R-VOS paradigm that explicitly models temporal instance consistency alongside the referring segmentation. Specifically, we introduce a novel hybrid memory that facilitates inter-frame collaboration for robust spatio-temporal matching and propagation. Features of frames with automatically generated high-quality reference masks are propagated to segment the remaining frames based on multi-granularity association to achieve temporally consistent R-VOS. Furthermore, we propose a new Mask Consistency Score (MCS) metric to evaluate the temporal consistency of video segmentation. Extensive experiments demonstrate that our approach enhances temporal consistency by a significant margin, leading to top-ranked performance on popular R-VOS benchmarks, i.e., Ref-YouTube-VOS (67.1%) and Ref-DAVIS17 (65.6%). The code is available at https://github.com/bo-miao/HTR.

Results

Task	Dataset	Metric	Value	Model
Video	MeViS	F	45.5	HTR
Video	MeViS	J	39.9	HTR
Video	MeViS	J&F	42.7	HTR
Video	Refer-YouTube-VOS	F	68.9	HTR
Video	Refer-YouTube-VOS	J	65.3	HTR
Video	Refer-YouTube-VOS	J&F	67.1	HTR
Instance Segmentation	Refer-YouTube-VOS (2021 public validation)	F	68.9	HTR (Pre-training)
Instance Segmentation	Refer-YouTube-VOS (2021 public validation)	J	65.3	HTR (Pre-training)
Instance Segmentation	Refer-YouTube-VOS (2021 public validation)	J&F	67.1	HTR (Pre-training)
Instance Segmentation	DAVIS 2017 (val)	J&F 1st frame	65.6	HTR
Video Object Segmentation	MeViS	F	45.5	HTR
Video Object Segmentation	MeViS	J	39.9	HTR
Video Object Segmentation	MeViS	J&F	42.7	HTR
Video Object Segmentation	Refer-YouTube-VOS	F	68.9	HTR
Video Object Segmentation	Refer-YouTube-VOS	J	65.3	HTR
Video Object Segmentation	Refer-YouTube-VOS	J&F	67.1	HTR
Referring Expression Segmentation	Refer-YouTube-VOS (2021 public validation)	F	68.9	HTR (Pre-training)
Referring Expression Segmentation	Refer-YouTube-VOS (2021 public validation)	J	65.3	HTR (Pre-training)
Referring Expression Segmentation	Refer-YouTube-VOS (2021 public validation)	J&F	67.1	HTR (Pre-training)
Referring Expression Segmentation	DAVIS 2017 (val)	J&F 1st frame	65.6	HTR

Temporally Consistent Referring Video Object Segmentation with Hybrid Memory

Abstract

Results

Related Papers

Temporally Consistent Referring Video Object Segmentation with Hybrid Memory

Abstract

Results

Related Papers