SynthRef: Generation of Synthetic Referring Expressions for Object Segmentation

Ioannis Kazakos, Carles Ventura, Miriam Bellver, Carina Silberer, Xavier Giro-i-Nieto

2021-06-08Referring Expression Segmentation Segmentation Video Object Segmentation object-detection

Abstract

Recent advances in deep learning have brought significant progress in visual grounding tasks such as language-guided video object segmentation. However, collecting large datasets for these tasks is expensive in terms of annotation time, which represents a bottleneck. To this end, we propose a novel method, namely SynthRef, for generating synthetic referring expressions for target objects in an image (or video frame), and we also present and disseminate the first large-scale dataset with synthetic referring expressions for video object segmentation. Our experiments demonstrate that by training with our synthetic referring expressions one can improve the ability of a model to generalize across different datasets, without any additional annotation cost. Moreover, our formulation allows its application to any object detection or segmentation dataset.

Results

Task	Dataset	Metric	Value	Model
Instance Segmentation	DAVIS 2017 (val)	J&F 1st frame	45.3	RefVOS + SynthRef-YouTube-VIS
Instance Segmentation	DAVIS 2017 (val)	J&F Full video	44.8	RefVOS + SynthRef-YouTube-VIS
Instance Segmentation	Refer-YouTube-VOS	Mean IoU	39.5	RefVOS-Human REs
Instance Segmentation	Refer-YouTube-VOS	Precision@0.5	38.6	RefVOS-Human REs
Instance Segmentation	Refer-YouTube-VOS	Precision@0.9	6.9	RefVOS-Human REs
Instance Segmentation	Refer-YouTube-VOS	Mean IoU	35	RefVOS-Synthetic REs
Instance Segmentation	Refer-YouTube-VOS	Precision@0.5	32.3	RefVOS-Synthetic REs
Instance Segmentation	Refer-YouTube-VOS	Precision@0.9	1.8	RefVOS-Synthetic REs
Referring Expression Segmentation	DAVIS 2017 (val)	J&F 1st frame	45.3	RefVOS + SynthRef-YouTube-VIS
Referring Expression Segmentation	DAVIS 2017 (val)	J&F Full video	44.8	RefVOS + SynthRef-YouTube-VIS
Referring Expression Segmentation	Refer-YouTube-VOS	Mean IoU	39.5	RefVOS-Human REs
Referring Expression Segmentation	Refer-YouTube-VOS	Precision@0.5	38.6	RefVOS-Human REs
Referring Expression Segmentation	Refer-YouTube-VOS	Precision@0.9	6.9	RefVOS-Human REs
Referring Expression Segmentation	Refer-YouTube-VOS	Mean IoU	35	RefVOS-Synthetic REs
Referring Expression Segmentation	Refer-YouTube-VOS	Precision@0.5	32.3	RefVOS-Synthetic REs
Referring Expression Segmentation	Refer-YouTube-VOS	Precision@0.9	1.8	RefVOS-Synthetic REs

SynthRef: Generation of Synthetic Referring Expressions for Object Segmentation

Abstract

Results

Related Papers

SynthRef: Generation of Synthetic Referring Expressions for Object Segmentation

Abstract

Results

Related Papers