Ioannis Kazakos, Carles Ventura, Miriam Bellver, Carina Silberer, Xavier Giro-i-Nieto
Recent advances in deep learning have brought significant progress in visual grounding tasks such as language-guided video object segmentation. However, collecting large datasets for these tasks is expensive in terms of annotation time, which represents a bottleneck. To this end, we propose a novel method, namely SynthRef, for generating synthetic referring expressions for target objects in an image (or video frame), and we also present and disseminate the first large-scale dataset with synthetic referring expressions for video object segmentation. Our experiments demonstrate that by training with our synthetic referring expressions one can improve the ability of a model to generalize across different datasets, without any additional annotation cost. Moreover, our formulation allows its application to any object detection or segmentation dataset.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Instance Segmentation | DAVIS 2017 (val) | J&F 1st frame | 45.3 | RefVOS + SynthRef-YouTube-VIS |
| Instance Segmentation | DAVIS 2017 (val) | J&F Full video | 44.8 | RefVOS + SynthRef-YouTube-VIS |
| Instance Segmentation | Refer-YouTube-VOS | Mean IoU | 39.5 | RefVOS-Human REs |
| Instance Segmentation | Refer-YouTube-VOS | Precision@0.5 | 38.6 | RefVOS-Human REs |
| Instance Segmentation | Refer-YouTube-VOS | Precision@0.9 | 6.9 | RefVOS-Human REs |
| Instance Segmentation | Refer-YouTube-VOS | Mean IoU | 35 | RefVOS-Synthetic REs |
| Instance Segmentation | Refer-YouTube-VOS | Precision@0.5 | 32.3 | RefVOS-Synthetic REs |
| Instance Segmentation | Refer-YouTube-VOS | Precision@0.9 | 1.8 | RefVOS-Synthetic REs |
| Referring Expression Segmentation | DAVIS 2017 (val) | J&F 1st frame | 45.3 | RefVOS + SynthRef-YouTube-VIS |
| Referring Expression Segmentation | DAVIS 2017 (val) | J&F Full video | 44.8 | RefVOS + SynthRef-YouTube-VIS |
| Referring Expression Segmentation | Refer-YouTube-VOS | Mean IoU | 39.5 | RefVOS-Human REs |
| Referring Expression Segmentation | Refer-YouTube-VOS | Precision@0.5 | 38.6 | RefVOS-Human REs |
| Referring Expression Segmentation | Refer-YouTube-VOS | Precision@0.9 | 6.9 | RefVOS-Human REs |
| Referring Expression Segmentation | Refer-YouTube-VOS | Mean IoU | 35 | RefVOS-Synthetic REs |
| Referring Expression Segmentation | Refer-YouTube-VOS | Precision@0.5 | 32.3 | RefVOS-Synthetic REs |
| Referring Expression Segmentation | Refer-YouTube-VOS | Precision@0.9 | 1.8 | RefVOS-Synthetic REs |