Guang Feng, Lihe Zhang, Zhiwei Hu, Huchuan Lu
Referring video segmentation aims to segment the corresponding video object described by the language expression. To address this task, we first design a two-stream encoder to extract CNN-based visual features and transformer-based linguistic features hierarchically, and a vision-language mutual guidance (VLMG) module is inserted into the encoder multiple times to promote the hierarchical and progressive fusion of multi-modal features. Compared with the existing multi-modal fusion methods, this two-stream encoder takes into account the multi-granularity linguistic context, and realizes the deep interleaving between modalities with the help of VLGM. In order to promote the temporal alignment between frames, we further propose a language-guided multi-scale dynamic filtering (LMDF) module to strengthen the temporal coherence, which uses the language-guided spatial-temporal features to generate a set of position-specific dynamic filters to more flexibly and effectively update the feature of current frame. Extensive experiments on four datasets verify the effectiveness of the proposed model.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Instance Segmentation | Refer-YouTube-VOS (2021 public validation) | F | 50.67 | VLIDE |
| Instance Segmentation | Refer-YouTube-VOS (2021 public validation) | J | 48.44 | VLIDE |
| Instance Segmentation | Refer-YouTube-VOS (2021 public validation) | J&F | 49.56 | VLIDE |
| Instance Segmentation | A2D Sentences | AP | 0.469 | VLIDE |
| Instance Segmentation | A2D Sentences | IoU mean | 0.598 | VLIDE |
| Instance Segmentation | A2D Sentences | IoU overall | 0.714 | VLIDE |
| Instance Segmentation | A2D Sentences | Precision@0.5 | 0.702 | VLIDE |
| Instance Segmentation | A2D Sentences | Precision@0.6 | 0.663 | VLIDE |
| Instance Segmentation | A2D Sentences | Precision@0.7 | 0.585 | VLIDE |
| Instance Segmentation | A2D Sentences | Precision@0.8 | 0.428 | VLIDE |
| Instance Segmentation | A2D Sentences | Precision@0.9 | 0.151 | VLIDE |
| Instance Segmentation | J-HMDB | AP | 0.441 | VLIDE |
| Instance Segmentation | J-HMDB | IoU mean | 0.666 | VLIDE |
| Instance Segmentation | J-HMDB | IoU overall | 0.68 | VLIDE |
| Instance Segmentation | J-HMDB | Precision@0.5 | 0.874 | VLIDE |
| Instance Segmentation | J-HMDB | Precision@0.6 | 0.791 | VLIDE |
| Instance Segmentation | J-HMDB | Precision@0.7 | 0.586 | VLIDE |
| Instance Segmentation | J-HMDB | Precision@0.8 | 0.182 | VLIDE |
| Instance Segmentation | J-HMDB | Precision@0.9 | 0.3 | VLIDE |
| Referring Expression Segmentation | Refer-YouTube-VOS (2021 public validation) | F | 50.67 | VLIDE |
| Referring Expression Segmentation | Refer-YouTube-VOS (2021 public validation) | J | 48.44 | VLIDE |
| Referring Expression Segmentation | Refer-YouTube-VOS (2021 public validation) | J&F | 49.56 | VLIDE |
| Referring Expression Segmentation | A2D Sentences | AP | 0.469 | VLIDE |
| Referring Expression Segmentation | A2D Sentences | IoU mean | 0.598 | VLIDE |
| Referring Expression Segmentation | A2D Sentences | IoU overall | 0.714 | VLIDE |
| Referring Expression Segmentation | A2D Sentences | Precision@0.5 | 0.702 | VLIDE |
| Referring Expression Segmentation | A2D Sentences | Precision@0.6 | 0.663 | VLIDE |
| Referring Expression Segmentation | A2D Sentences | Precision@0.7 | 0.585 | VLIDE |
| Referring Expression Segmentation | A2D Sentences | Precision@0.8 | 0.428 | VLIDE |
| Referring Expression Segmentation | A2D Sentences | Precision@0.9 | 0.151 | VLIDE |
| Referring Expression Segmentation | J-HMDB | AP | 0.441 | VLIDE |
| Referring Expression Segmentation | J-HMDB | IoU mean | 0.666 | VLIDE |
| Referring Expression Segmentation | J-HMDB | IoU overall | 0.68 | VLIDE |
| Referring Expression Segmentation | J-HMDB | Precision@0.5 | 0.874 | VLIDE |
| Referring Expression Segmentation | J-HMDB | Precision@0.6 | 0.791 | VLIDE |
| Referring Expression Segmentation | J-HMDB | Precision@0.7 | 0.586 | VLIDE |
| Referring Expression Segmentation | J-HMDB | Precision@0.8 | 0.182 | VLIDE |
| Referring Expression Segmentation | J-HMDB | Precision@0.9 | 0.3 | VLIDE |