Miriam Bellver, Carles Ventura, Carina Silberer, Ioannis Kazakos, Jordi Torres, Xavier Giro-i-Nieto
The task of video object segmentation with referring expressions (language-guided VOS) is to, given a linguistic phrase and a video, generate binary masks for the object to which the phrase refers. Our work argues that existing benchmarks used for this task are mainly composed of trivial cases, in which referents can be identified with simple phrases. Our analysis relies on a new categorization of the phrases in the DAVIS-2017 and Actor-Action datasets into trivial and non-trivial REs, with the non-trivial REs annotated with seven RE semantic categories. We leverage this data to analyze the results of RefVOS, a novel neural network that obtains competitive results for the task of language-guided image segmentation and state of the art results for language-guided VOS. Our study indicates that the major challenges for the task are related to understanding motion and static actions.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Instance Segmentation | RefCoCo val | Overall IoU | 59.45 | RefVOS with BERT + MLM loss |
| Instance Segmentation | RefCoCo val | Overall IoU | 58.65 | RefVOS with BERT Pre-train |
| Instance Segmentation | A2D Sentences | IoU mean | 0.599 | RefVOS |
| Instance Segmentation | A2D Sentences | IoU overall | 0.599 | RefVOS |
| Instance Segmentation | A2D Sentences | Precision@0.5 | 0.495 | RefVOS |
| Instance Segmentation | A2D Sentences | Precision@0.9 | 0.064 | RefVOS |
| Instance Segmentation | RefCOCO+ val | Overall IoU | 44.71 | RefVOS with BERT + MLM loss |
| Instance Segmentation | A2Dre test | Mean IoU | 33.2 | RefVos |
| Instance Segmentation | A2Dre test | Overall IoU | 47.5 | RefVos |
| Instance Segmentation | RefCOCO+ test B | Overall IoU | 36.17 | RefVOS with BERT + MLM loss |
| Instance Segmentation | DAVIS 2017 (val) | J&F 1st frame | 45.1 | RefVOS |
| Instance Segmentation | DAVIS 2017 (val) | J&F 1st frame | 44.5 | RefVOS |
| Instance Segmentation | DAVIS 2017 (val) | J&F Full video | 45.1 | RefVOS |
| Instance Segmentation | RefCOCO+ testA | Overall IoU | 49.73 | RefVOS with BERT + MLM Loss |
| Referring Expression Segmentation | RefCoCo val | Overall IoU | 59.45 | RefVOS with BERT + MLM loss |
| Referring Expression Segmentation | RefCoCo val | Overall IoU | 58.65 | RefVOS with BERT Pre-train |
| Referring Expression Segmentation | A2D Sentences | IoU mean | 0.599 | RefVOS |
| Referring Expression Segmentation | A2D Sentences | IoU overall | 0.599 | RefVOS |
| Referring Expression Segmentation | A2D Sentences | Precision@0.5 | 0.495 | RefVOS |
| Referring Expression Segmentation | A2D Sentences | Precision@0.9 | 0.064 | RefVOS |
| Referring Expression Segmentation | RefCOCO+ val | Overall IoU | 44.71 | RefVOS with BERT + MLM loss |
| Referring Expression Segmentation | A2Dre test | Mean IoU | 33.2 | RefVos |
| Referring Expression Segmentation | A2Dre test | Overall IoU | 47.5 | RefVos |
| Referring Expression Segmentation | RefCOCO+ test B | Overall IoU | 36.17 | RefVOS with BERT + MLM loss |
| Referring Expression Segmentation | DAVIS 2017 (val) | J&F 1st frame | 45.1 | RefVOS |
| Referring Expression Segmentation | DAVIS 2017 (val) | J&F 1st frame | 44.5 | RefVOS |
| Referring Expression Segmentation | DAVIS 2017 (val) | J&F Full video | 45.1 | RefVOS |
| Referring Expression Segmentation | RefCOCO+ testA | Overall IoU | 49.73 | RefVOS with BERT + MLM Loss |