Towards Robust Referring Video Object Segmentation with Cyclic Relational Consensus

Xiang Li, Jinglu Wang, Xiaohao Xu, Xiao Li, Bhiksha Raj, Yan Lu

2022-07-04Referring Video Object Segmentation Referring Expression Segmentation Semantic Segmentation Video Object Segmentation Video Semantic Segmentation

Paper PDF Code(official)

Abstract

Referring Video Object Segmentation (R-VOS) is a challenging task that aims to segment an object in a video based on a linguistic expression. Most existing R-VOS methods have a critical assumption: the object referred to must appear in the video. This assumption, which we refer to as semantic consensus, is often violated in real-world scenarios, where the expression may be queried against false videos. In this work, we highlight the need for a robust R-VOS model that can handle semantic mismatches. Accordingly, we propose an extended task called Robust R-VOS, which accepts unpaired video-text inputs. We tackle this problem by jointly modeling the primary R-VOS problem and its dual (text reconstruction). A structural text-to-text cycle constraint is introduced to discriminate semantic consensus between video-text pairs and impose it in positive pairs, thereby achieving multi-modal alignment from both positive and negative pairs. Our structural constraint effectively addresses the challenge posed by linguistic diversity, overcoming the limitations of previous methods that relied on the point-wise constraint. A new evaluation dataset, R\textsuperscript{2}-Youtube-VOSis constructed to measure the model robustness. Our model achieves state-of-the-art performance on R-VOS benchmarks, Ref-DAVIS17 and Ref-Youtube-VOS, and also our R\textsuperscript{2}-Youtube-VOS~dataset.

Results

Task	Dataset	Metric	Value	Model
Video	Refer-YouTube-VOS	F	61.5	R2VOS (Swin-T)
Video	Refer-YouTube-VOS	J	58.9	R2VOS (Swin-T)
Video	Refer-YouTube-VOS	J&F	60.2	R2VOS (Swin-T)
Instance Segmentation	Refer-YouTube-VOS (2021 public validation)	F	63.1	R2VOS (Video-Swin-T)
Instance Segmentation	Refer-YouTube-VOS (2021 public validation)	J	59.6	R2VOS (Video-Swin-T)
Instance Segmentation	Refer-YouTube-VOS (2021 public validation)	J&F	61.3	R2VOS (Video-Swin-T)
Video Object Segmentation	Refer-YouTube-VOS	F	61.5	R2VOS (Swin-T)
Video Object Segmentation	Refer-YouTube-VOS	J	58.9	R2VOS (Swin-T)
Video Object Segmentation	Refer-YouTube-VOS	J&F	60.2	R2VOS (Swin-T)
Referring Expression Segmentation	Refer-YouTube-VOS (2021 public validation)	F	63.1	R2VOS (Video-Swin-T)
Referring Expression Segmentation	Refer-YouTube-VOS (2021 public validation)	J	59.6	R2VOS (Video-Swin-T)
Referring Expression Segmentation	Refer-YouTube-VOS (2021 public validation)	J&F	61.3	R2VOS (Video-Swin-T)

Towards Robust Referring Video Object Segmentation with Cyclic Relational Consensus

Abstract

Results

Related Papers

Towards Robust Referring Video Object Segmentation with Cyclic Relational Consensus

Abstract

Results

Related Papers