Wangbo Zhao, Kai Wang, Xiangxiang Chu, Fuzhao Xue, Xinchao Wang, Yang You
Text-based video segmentation aims to segment the target object in a video based on a describing sentence. Incorporating motion information from optical flow maps with appearance and linguistic modalities is crucial yet has been largely ignored by previous work. In this paper, we design a method to fuse and align appearance, motion, and linguistic features to achieve accurate segmentation. Specifically, we propose a multi-modal video transformer, which can fuse and aggregate multi-modal and temporal features between frames. Furthermore, we design a language-guided feature fusion module to progressively fuse appearance and motion features in each feature level with guidance from linguistic features. Finally, a multi-modal alignment loss is proposed to alleviate the semantic gap between features from different modalities. Extensive experiments on A2D Sentences and J-HMDB Sentences verify the performance and the generalization ability of our method compared to the state-of-the-art methods.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Instance Segmentation | A2D Sentences | AP | 0.419 | mmmmtbvs |
| Instance Segmentation | A2D Sentences | IoU mean | 0.558 | mmmmtbvs |
| Instance Segmentation | A2D Sentences | IoU overall | 0.673 | mmmmtbvs |
| Instance Segmentation | A2D Sentences | Precision@0.5 | 0.645 | mmmmtbvs |
| Instance Segmentation | A2D Sentences | Precision@0.6 | 0.597 | mmmmtbvs |
| Instance Segmentation | A2D Sentences | Precision@0.7 | 0.523 | mmmmtbvs |
| Instance Segmentation | A2D Sentences | Precision@0.8 | 0.375 | mmmmtbvs |
| Instance Segmentation | A2D Sentences | Precision@0.9 | 0.13 | mmmmtbvs |
| Referring Expression Segmentation | A2D Sentences | AP | 0.419 | mmmmtbvs |
| Referring Expression Segmentation | A2D Sentences | IoU mean | 0.558 | mmmmtbvs |
| Referring Expression Segmentation | A2D Sentences | IoU overall | 0.673 | mmmmtbvs |
| Referring Expression Segmentation | A2D Sentences | Precision@0.5 | 0.645 | mmmmtbvs |
| Referring Expression Segmentation | A2D Sentences | Precision@0.6 | 0.597 | mmmmtbvs |
| Referring Expression Segmentation | A2D Sentences | Precision@0.7 | 0.523 | mmmmtbvs |
| Referring Expression Segmentation | A2D Sentences | Precision@0.8 | 0.375 | mmmmtbvs |
| Referring Expression Segmentation | A2D Sentences | Precision@0.9 | 0.13 | mmmmtbvs |