Decoupling Static and Hierarchical Motion Perception for Referring Video Segmentation

Shuting He, Henghui Ding

2024-04-04CVPR 2024 1Referring Expression Referring Video Object Segmentation Referring Expression Segmentation Sentence Embeddings Video Segmentation Contrastive Learning Video Semantic Segmentation

Paper PDF Code(official)

Abstract

Referring video segmentation relies on natural language expressions to identify and segment objects, often emphasizing motion clues. Previous works treat a sentence as a whole and directly perform identification at the video-level, mixing up static image-level cues with temporal motion cues. However, image-level features cannot well comprehend motion cues in sentences, and static cues are not crucial for temporal perception. In fact, static cues can sometimes interfere with temporal perception by overshadowing motion cues. In this work, we propose to decouple video-level referring expression understanding into static and motion perception, with a specific emphasis on enhancing temporal comprehension. Firstly, we introduce an expression-decoupling module to make static cues and motion cues perform their distinct role, alleviating the issue of sentence embeddings overlooking motion cues. Secondly, we propose a hierarchical motion perception module to capture temporal information effectively across varying timescales. Furthermore, we employ contrastive learning to distinguish the motions of visually similar objects. These contributions yield state-of-the-art performance across five datasets, including a remarkable $\textbf{9.2%}$ $\mathcal{J\&F}$ improvement on the challenging $\textbf{MeViS}$ dataset. Code is available at https://github.com/heshuting555/DsHmp.

Results

Task	Dataset	Metric	Value	Model
Video	MeViS	F	49.8	DsHmp
Video	MeViS	J	43	DsHmp
Video	MeViS	J&F	46.4	DsHmp
Video	Ref-DAVIS17	F	68.1	DsHmp
Video	Ref-DAVIS17	J	61.7	DsHmp
Video	Ref-DAVIS17	J&F	64.9	DsHmp
Instance Segmentation	Refer-YouTube-VOS (2021 public validation)	F	69.1	DsHmp (Video-Swin-Base)
Instance Segmentation	Refer-YouTube-VOS (2021 public validation)	J	65	DsHmp (Video-Swin-Base)
Instance Segmentation	Refer-YouTube-VOS (2021 public validation)	J&F	67.1	DsHmp (Video-Swin-Base)
Video Object Segmentation	MeViS	F	49.8	DsHmp
Video Object Segmentation	MeViS	J	43	DsHmp
Video Object Segmentation	MeViS	J&F	46.4	DsHmp
Video Object Segmentation	Ref-DAVIS17	F	68.1	DsHmp
Video Object Segmentation	Ref-DAVIS17	J	61.7	DsHmp
Video Object Segmentation	Ref-DAVIS17	J&F	64.9	DsHmp
Referring Expression Segmentation	Refer-YouTube-VOS (2021 public validation)	F	69.1	DsHmp (Video-Swin-Base)
Referring Expression Segmentation	Refer-YouTube-VOS (2021 public validation)	J	65	DsHmp (Video-Swin-Base)
Referring Expression Segmentation	Refer-YouTube-VOS (2021 public validation)	J&F	67.1	DsHmp (Video-Swin-Base)

Decoupling Static and Hierarchical Motion Perception for Referring Video Segmentation

Abstract

Results

Related Papers

Decoupling Static and Hierarchical Motion Perception for Referring Video Segmentation

Abstract

Results

Related Papers