Deeply Interleaved Two-Stream Encoder for Referring Video Segmentation

Guang Feng, Lihe Zhang, Zhiwei Hu, Huchuan Lu

2022-03-30Referring Expression Segmentation Video Segmentation Video Semantic Segmentation Vocal Bursts Valence Prediction

Paper PDF

Abstract

Referring video segmentation aims to segment the corresponding video object described by the language expression. To address this task, we first design a two-stream encoder to extract CNN-based visual features and transformer-based linguistic features hierarchically, and a vision-language mutual guidance (VLMG) module is inserted into the encoder multiple times to promote the hierarchical and progressive fusion of multi-modal features. Compared with the existing multi-modal fusion methods, this two-stream encoder takes into account the multi-granularity linguistic context, and realizes the deep interleaving between modalities with the help of VLGM. In order to promote the temporal alignment between frames, we further propose a language-guided multi-scale dynamic filtering (LMDF) module to strengthen the temporal coherence, which uses the language-guided spatial-temporal features to generate a set of position-specific dynamic filters to more flexibly and effectively update the feature of current frame. Extensive experiments on four datasets verify the effectiveness of the proposed model.

Results

Task	Dataset	Metric	Value	Model
Instance Segmentation	Refer-YouTube-VOS (2021 public validation)	F	50.67	VLIDE
Instance Segmentation	Refer-YouTube-VOS (2021 public validation)	J	48.44	VLIDE
Instance Segmentation	Refer-YouTube-VOS (2021 public validation)	J&F	49.56	VLIDE
Instance Segmentation	A2D Sentences	AP	0.469	VLIDE
Instance Segmentation	A2D Sentences	IoU mean	0.598	VLIDE
Instance Segmentation	A2D Sentences	IoU overall	0.714	VLIDE
Instance Segmentation	A2D Sentences	Precision@0.5	0.702	VLIDE
Instance Segmentation	A2D Sentences	Precision@0.6	0.663	VLIDE
Instance Segmentation	A2D Sentences	Precision@0.7	0.585	VLIDE
Instance Segmentation	A2D Sentences	Precision@0.8	0.428	VLIDE
Instance Segmentation	A2D Sentences	Precision@0.9	0.151	VLIDE
Instance Segmentation	J-HMDB	AP	0.441	VLIDE
Instance Segmentation	J-HMDB	IoU mean	0.666	VLIDE
Instance Segmentation	J-HMDB	IoU overall	0.68	VLIDE
Instance Segmentation	J-HMDB	Precision@0.5	0.874	VLIDE
Instance Segmentation	J-HMDB	Precision@0.6	0.791	VLIDE
Instance Segmentation	J-HMDB	Precision@0.7	0.586	VLIDE
Instance Segmentation	J-HMDB	Precision@0.8	0.182	VLIDE
Instance Segmentation	J-HMDB	Precision@0.9	0.3	VLIDE
Referring Expression Segmentation	Refer-YouTube-VOS (2021 public validation)	F	50.67	VLIDE
Referring Expression Segmentation	Refer-YouTube-VOS (2021 public validation)	J	48.44	VLIDE
Referring Expression Segmentation	Refer-YouTube-VOS (2021 public validation)	J&F	49.56	VLIDE
Referring Expression Segmentation	A2D Sentences	AP	0.469	VLIDE
Referring Expression Segmentation	A2D Sentences	IoU mean	0.598	VLIDE
Referring Expression Segmentation	A2D Sentences	IoU overall	0.714	VLIDE
Referring Expression Segmentation	A2D Sentences	Precision@0.5	0.702	VLIDE
Referring Expression Segmentation	A2D Sentences	Precision@0.6	0.663	VLIDE
Referring Expression Segmentation	A2D Sentences	Precision@0.7	0.585	VLIDE
Referring Expression Segmentation	A2D Sentences	Precision@0.8	0.428	VLIDE
Referring Expression Segmentation	A2D Sentences	Precision@0.9	0.151	VLIDE
Referring Expression Segmentation	J-HMDB	AP	0.441	VLIDE
Referring Expression Segmentation	J-HMDB	IoU mean	0.666	VLIDE
Referring Expression Segmentation	J-HMDB	IoU overall	0.68	VLIDE
Referring Expression Segmentation	J-HMDB	Precision@0.5	0.874	VLIDE
Referring Expression Segmentation	J-HMDB	Precision@0.6	0.791	VLIDE
Referring Expression Segmentation	J-HMDB	Precision@0.7	0.586	VLIDE
Referring Expression Segmentation	J-HMDB	Precision@0.8	0.182	VLIDE
Referring Expression Segmentation	J-HMDB	Precision@0.9	0.3	VLIDE

Deeply Interleaved Two-Stream Encoder for Referring Video Segmentation

Abstract

Results

Related Papers

Deeply Interleaved Two-Stream Encoder for Referring Video Segmentation

Abstract

Results

Related Papers