Zhuoyan Luo, Yicheng Xiao, Yong liu, Shuyan Li, Yitong Wang, Yansong Tang, Xiu Li, Yujiu Yang
This paper studies referring video object segmentation (RVOS) by boosting video-level visual-linguistic alignment. Recent approaches model the RVOS task as a sequence prediction problem and perform multi-modal interaction as well as segmentation for each frame separately. However, the lack of a global view of video content leads to difficulties in effectively utilizing inter-frame relationships and understanding textual descriptions of object temporal variations. To address this issue, we propose Semantic-assisted Object Cluster (SOC), which aggregates video content and textual guidance for unified temporal modeling and cross-modal alignment. By associating a group of frame-level object embeddings with language tokens, SOC facilitates joint space learning across modalities and time steps. Moreover, we present multi-modal contrastive supervision to help construct well-aligned joint space at the video level. We conduct extensive experiments on popular RVOS benchmarks, and our method outperforms state-of-the-art competitors on all benchmarks by a remarkable margin. Besides, the emphasis on temporal coherence enhances the segmentation stability and adaptability of our method in processing text expressions with temporal variations. Code will be available.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | Refer-YouTube-VOS | F | 67.9 | SOC |
| Video | Refer-YouTube-VOS | J | 64.1 | SOC |
| Video | Refer-YouTube-VOS | J&F | 66 | SOC |
| Video | Ref-DAVIS17 | F | 69.1 | SOC |
| Video | Ref-DAVIS17 | J | 62.5 | SOC |
| Video | Ref-DAVIS17 | J&F | 65.8 | SOC |
| Video | Long-RVOS | J&F | 34.9 | SOC |
| Video | Long-RVOS | tIoU | 68.1 | SOC |
| Video | Long-RVOS | vIoU | 28.6 | SOC |
| Instance Segmentation | Refer-YouTube-VOS (2021 public validation) | F | 69.3 | SOC (Joint training, Video-Swin-B) |
| Instance Segmentation | Refer-YouTube-VOS (2021 public validation) | J | 65.3 | SOC (Joint training, Video-Swin-B) |
| Instance Segmentation | Refer-YouTube-VOS (2021 public validation) | F | 60.5 | SOC (Video-Swin-T) |
| Instance Segmentation | Refer-YouTube-VOS (2021 public validation) | J | 57.8 | SOC (Video-Swin-T) |
| Instance Segmentation | Refer-YouTube-VOS (2021 public validation) | J&F | 59.2 | SOC (Video-Swin-T) |
| Instance Segmentation | A2D Sentences | AP | 0.573 | SOC (Video-Swin-B) |
| Instance Segmentation | A2D Sentences | IoU mean | 0.725 | SOC (Video-Swin-B) |
| Instance Segmentation | A2D Sentences | IoU overall | 0.807 | SOC (Video-Swin-B) |
| Instance Segmentation | A2D Sentences | Precision@0.5 | 0.851 | SOC (Video-Swin-B) |
| Instance Segmentation | A2D Sentences | Precision@0.6 | 0.827 | SOC (Video-Swin-B) |
| Instance Segmentation | A2D Sentences | Precision@0.7 | 0.765 | SOC (Video-Swin-B) |
| Instance Segmentation | A2D Sentences | Precision@0.8 | 0.607 | SOC (Video-Swin-B) |
| Instance Segmentation | A2D Sentences | Precision@0.9 | 0.252 | SOC (Video-Swin-B) |
| Instance Segmentation | A2D Sentences | AP | 0.504 | SOC (Video-Swin-T) |
| Instance Segmentation | A2D Sentences | IoU mean | 0.669 | SOC (Video-Swin-T) |
| Instance Segmentation | A2D Sentences | IoU overall | 0.747 | SOC (Video-Swin-T) |
| Instance Segmentation | A2D Sentences | Precision@0.5 | 0.79 | SOC (Video-Swin-T) |
| Instance Segmentation | A2D Sentences | Precision@0.6 | 0.756 | SOC (Video-Swin-T) |
| Instance Segmentation | A2D Sentences | Precision@0.7 | 0.687 | SOC (Video-Swin-T) |
| Instance Segmentation | A2D Sentences | Precision@0.8 | 0.535 | SOC (Video-Swin-T) |
| Instance Segmentation | A2D Sentences | Precision@0.9 | 0.195 | SOC (Video-Swin-T) |
| Instance Segmentation | J-HMDB | AP | 0.446 | SOC (Video-Swin-B) |
| Instance Segmentation | J-HMDB | IoU mean | 0.723 | SOC (Video-Swin-B) |
| Instance Segmentation | J-HMDB | IoU overall | 0.736 | SOC (Video-Swin-B) |
| Instance Segmentation | J-HMDB | Precision@0.5 | 0.969 | SOC (Video-Swin-B) |
| Instance Segmentation | J-HMDB | Precision@0.6 | 0.914 | SOC (Video-Swin-B) |
| Instance Segmentation | J-HMDB | Precision@0.7 | 0.711 | SOC (Video-Swin-B) |
| Instance Segmentation | J-HMDB | Precision@0.8 | 0.213 | SOC (Video-Swin-B) |
| Instance Segmentation | J-HMDB | Precision@0.9 | 0.001 | SOC (Video-Swin-B) |
| Instance Segmentation | J-HMDB | AP | 0.397 | SOC (Video-Swin-T) |
| Instance Segmentation | J-HMDB | IoU mean | 0.701 | SOC (Video-Swin-T) |
| Instance Segmentation | J-HMDB | IoU overall | 0.707 | SOC (Video-Swin-T) |
| Instance Segmentation | J-HMDB | Precision@0.5 | 0.947 | SOC (Video-Swin-T) |
| Instance Segmentation | J-HMDB | Precision@0.6 | 0.864 | SOC (Video-Swin-T) |
| Instance Segmentation | J-HMDB | Precision@0.7 | 0.627 | SOC (Video-Swin-T) |
| Instance Segmentation | J-HMDB | Precision@0.8 | 0.179 | SOC (Video-Swin-T) |
| Instance Segmentation | J-HMDB | Precision@0.9 | 0.001 | SOC (Video-Swin-T) |
| Video Object Segmentation | Refer-YouTube-VOS | F | 67.9 | SOC |
| Video Object Segmentation | Refer-YouTube-VOS | J | 64.1 | SOC |
| Video Object Segmentation | Refer-YouTube-VOS | J&F | 66 | SOC |
| Video Object Segmentation | Ref-DAVIS17 | F | 69.1 | SOC |
| Video Object Segmentation | Ref-DAVIS17 | J | 62.5 | SOC |
| Video Object Segmentation | Ref-DAVIS17 | J&F | 65.8 | SOC |
| Video Object Segmentation | Long-RVOS | J&F | 34.9 | SOC |
| Video Object Segmentation | Long-RVOS | tIoU | 68.1 | SOC |
| Video Object Segmentation | Long-RVOS | vIoU | 28.6 | SOC |
| Referring Expression Segmentation | Refer-YouTube-VOS (2021 public validation) | F | 69.3 | SOC (Joint training, Video-Swin-B) |
| Referring Expression Segmentation | Refer-YouTube-VOS (2021 public validation) | J | 65.3 | SOC (Joint training, Video-Swin-B) |
| Referring Expression Segmentation | Refer-YouTube-VOS (2021 public validation) | F | 60.5 | SOC (Video-Swin-T) |
| Referring Expression Segmentation | Refer-YouTube-VOS (2021 public validation) | J | 57.8 | SOC (Video-Swin-T) |
| Referring Expression Segmentation | Refer-YouTube-VOS (2021 public validation) | J&F | 59.2 | SOC (Video-Swin-T) |
| Referring Expression Segmentation | A2D Sentences | AP | 0.573 | SOC (Video-Swin-B) |
| Referring Expression Segmentation | A2D Sentences | IoU mean | 0.725 | SOC (Video-Swin-B) |
| Referring Expression Segmentation | A2D Sentences | IoU overall | 0.807 | SOC (Video-Swin-B) |
| Referring Expression Segmentation | A2D Sentences | Precision@0.5 | 0.851 | SOC (Video-Swin-B) |
| Referring Expression Segmentation | A2D Sentences | Precision@0.6 | 0.827 | SOC (Video-Swin-B) |
| Referring Expression Segmentation | A2D Sentences | Precision@0.7 | 0.765 | SOC (Video-Swin-B) |
| Referring Expression Segmentation | A2D Sentences | Precision@0.8 | 0.607 | SOC (Video-Swin-B) |
| Referring Expression Segmentation | A2D Sentences | Precision@0.9 | 0.252 | SOC (Video-Swin-B) |
| Referring Expression Segmentation | A2D Sentences | AP | 0.504 | SOC (Video-Swin-T) |
| Referring Expression Segmentation | A2D Sentences | IoU mean | 0.669 | SOC (Video-Swin-T) |
| Referring Expression Segmentation | A2D Sentences | IoU overall | 0.747 | SOC (Video-Swin-T) |
| Referring Expression Segmentation | A2D Sentences | Precision@0.5 | 0.79 | SOC (Video-Swin-T) |
| Referring Expression Segmentation | A2D Sentences | Precision@0.6 | 0.756 | SOC (Video-Swin-T) |
| Referring Expression Segmentation | A2D Sentences | Precision@0.7 | 0.687 | SOC (Video-Swin-T) |
| Referring Expression Segmentation | A2D Sentences | Precision@0.8 | 0.535 | SOC (Video-Swin-T) |
| Referring Expression Segmentation | A2D Sentences | Precision@0.9 | 0.195 | SOC (Video-Swin-T) |
| Referring Expression Segmentation | J-HMDB | AP | 0.446 | SOC (Video-Swin-B) |
| Referring Expression Segmentation | J-HMDB | IoU mean | 0.723 | SOC (Video-Swin-B) |
| Referring Expression Segmentation | J-HMDB | IoU overall | 0.736 | SOC (Video-Swin-B) |
| Referring Expression Segmentation | J-HMDB | Precision@0.5 | 0.969 | SOC (Video-Swin-B) |
| Referring Expression Segmentation | J-HMDB | Precision@0.6 | 0.914 | SOC (Video-Swin-B) |
| Referring Expression Segmentation | J-HMDB | Precision@0.7 | 0.711 | SOC (Video-Swin-B) |
| Referring Expression Segmentation | J-HMDB | Precision@0.8 | 0.213 | SOC (Video-Swin-B) |
| Referring Expression Segmentation | J-HMDB | Precision@0.9 | 0.001 | SOC (Video-Swin-B) |
| Referring Expression Segmentation | J-HMDB | AP | 0.397 | SOC (Video-Swin-T) |
| Referring Expression Segmentation | J-HMDB | IoU mean | 0.701 | SOC (Video-Swin-T) |
| Referring Expression Segmentation | J-HMDB | IoU overall | 0.707 | SOC (Video-Swin-T) |
| Referring Expression Segmentation | J-HMDB | Precision@0.5 | 0.947 | SOC (Video-Swin-T) |
| Referring Expression Segmentation | J-HMDB | Precision@0.6 | 0.864 | SOC (Video-Swin-T) |
| Referring Expression Segmentation | J-HMDB | Precision@0.7 | 0.627 | SOC (Video-Swin-T) |
| Referring Expression Segmentation | J-HMDB | Precision@0.8 | 0.179 | SOC (Video-Swin-T) |
| Referring Expression Segmentation | J-HMDB | Precision@0.9 | 0.001 | SOC (Video-Swin-T) |