RefVOS: A Closer Look at Referring Expressions for Video Object Segmentation

Miriam Bellver, Carles Ventura, Carina Silberer, Ioannis Kazakos, Jordi Torres, Xavier Giro-i-Nieto

2020-10-01Referring Expression Segmentation Segmentation Video Object Segmentation Image Segmentation

Abstract

The task of video object segmentation with referring expressions (language-guided VOS) is to, given a linguistic phrase and a video, generate binary masks for the object to which the phrase refers. Our work argues that existing benchmarks used for this task are mainly composed of trivial cases, in which referents can be identified with simple phrases. Our analysis relies on a new categorization of the phrases in the DAVIS-2017 and Actor-Action datasets into trivial and non-trivial REs, with the non-trivial REs annotated with seven RE semantic categories. We leverage this data to analyze the results of RefVOS, a novel neural network that obtains competitive results for the task of language-guided image segmentation and state of the art results for language-guided VOS. Our study indicates that the major challenges for the task are related to understanding motion and static actions.

Results

Task	Dataset	Metric	Value	Model
Instance Segmentation	RefCoCo val	Overall IoU	59.45	RefVOS with BERT + MLM loss
Instance Segmentation	RefCoCo val	Overall IoU	58.65	RefVOS with BERT Pre-train
Instance Segmentation	A2D Sentences	IoU mean	0.599	RefVOS
Instance Segmentation	A2D Sentences	IoU overall	0.599	RefVOS
Instance Segmentation	A2D Sentences	Precision@0.5	0.495	RefVOS
Instance Segmentation	A2D Sentences	Precision@0.9	0.064	RefVOS
Instance Segmentation	RefCOCO+ val	Overall IoU	44.71	RefVOS with BERT + MLM loss
Instance Segmentation	A2Dre test	Mean IoU	33.2	RefVos
Instance Segmentation	A2Dre test	Overall IoU	47.5	RefVos
Instance Segmentation	RefCOCO+ test B	Overall IoU	36.17	RefVOS with BERT + MLM loss
Instance Segmentation	DAVIS 2017 (val)	J&F 1st frame	45.1	RefVOS
Instance Segmentation	DAVIS 2017 (val)	J&F 1st frame	44.5	RefVOS
Instance Segmentation	DAVIS 2017 (val)	J&F Full video	45.1	RefVOS
Instance Segmentation	RefCOCO+ testA	Overall IoU	49.73	RefVOS with BERT + MLM Loss
Referring Expression Segmentation	RefCoCo val	Overall IoU	59.45	RefVOS with BERT + MLM loss
Referring Expression Segmentation	RefCoCo val	Overall IoU	58.65	RefVOS with BERT Pre-train
Referring Expression Segmentation	A2D Sentences	IoU mean	0.599	RefVOS
Referring Expression Segmentation	A2D Sentences	IoU overall	0.599	RefVOS
Referring Expression Segmentation	A2D Sentences	Precision@0.5	0.495	RefVOS
Referring Expression Segmentation	A2D Sentences	Precision@0.9	0.064	RefVOS
Referring Expression Segmentation	RefCOCO+ val	Overall IoU	44.71	RefVOS with BERT + MLM loss
Referring Expression Segmentation	A2Dre test	Mean IoU	33.2	RefVos
Referring Expression Segmentation	A2Dre test	Overall IoU	47.5	RefVos
Referring Expression Segmentation	RefCOCO+ test B	Overall IoU	36.17	RefVOS with BERT + MLM loss
Referring Expression Segmentation	DAVIS 2017 (val)	J&F 1st frame	45.1	RefVOS
Referring Expression Segmentation	DAVIS 2017 (val)	J&F 1st frame	44.5	RefVOS
Referring Expression Segmentation	DAVIS 2017 (val)	J&F Full video	45.1	RefVOS
Referring Expression Segmentation	RefCOCO+ testA	Overall IoU	49.73	RefVOS with BERT + MLM Loss

RefVOS: A Closer Look at Referring Expressions for Video Object Segmentation

Abstract

Results

Related Papers

RefVOS: A Closer Look at Referring Expressions for Video Object Segmentation

Abstract

Results

Related Papers