TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Modeling Motion with Multi-Modal Features for Text-Based V...

Modeling Motion with Multi-Modal Features for Text-Based Video Segmentation

Wangbo Zhao, Kai Wang, Xiangxiang Chu, Fuzhao Xue, Xinchao Wang, Yang You

2022-04-06CVPR 2022 1Optical Flow EstimationReferring Expression SegmentationSegmentationVideo SegmentationVideo Semantic Segmentation
PaperPDFCode(official)

Abstract

Text-based video segmentation aims to segment the target object in a video based on a describing sentence. Incorporating motion information from optical flow maps with appearance and linguistic modalities is crucial yet has been largely ignored by previous work. In this paper, we design a method to fuse and align appearance, motion, and linguistic features to achieve accurate segmentation. Specifically, we propose a multi-modal video transformer, which can fuse and aggregate multi-modal and temporal features between frames. Furthermore, we design a language-guided feature fusion module to progressively fuse appearance and motion features in each feature level with guidance from linguistic features. Finally, a multi-modal alignment loss is proposed to alleviate the semantic gap between features from different modalities. Extensive experiments on A2D Sentences and J-HMDB Sentences verify the performance and the generalization ability of our method compared to the state-of-the-art methods.

Results

TaskDatasetMetricValueModel
Instance SegmentationA2D SentencesAP0.419mmmmtbvs
Instance SegmentationA2D SentencesIoU mean0.558mmmmtbvs
Instance SegmentationA2D SentencesIoU overall0.673mmmmtbvs
Instance SegmentationA2D SentencesPrecision@0.50.645mmmmtbvs
Instance SegmentationA2D SentencesPrecision@0.60.597mmmmtbvs
Instance SegmentationA2D SentencesPrecision@0.70.523mmmmtbvs
Instance SegmentationA2D SentencesPrecision@0.80.375mmmmtbvs
Instance SegmentationA2D SentencesPrecision@0.90.13mmmmtbvs
Referring Expression SegmentationA2D SentencesAP0.419mmmmtbvs
Referring Expression SegmentationA2D SentencesIoU mean0.558mmmmtbvs
Referring Expression SegmentationA2D SentencesIoU overall0.673mmmmtbvs
Referring Expression SegmentationA2D SentencesPrecision@0.50.645mmmmtbvs
Referring Expression SegmentationA2D SentencesPrecision@0.60.597mmmmtbvs
Referring Expression SegmentationA2D SentencesPrecision@0.70.523mmmmtbvs
Referring Expression SegmentationA2D SentencesPrecision@0.80.375mmmmtbvs
Referring Expression SegmentationA2D SentencesPrecision@0.90.13mmmmtbvs

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21Channel-wise Motion Features for Efficient Motion Segmentation2025-07-17Deep Learning-Based Fetal Lung Segmentation from Diffusion-weighted MRI Images and Lung Maturity Evaluation for Fetal Growth Restriction2025-07-17DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17From Variability To Accuracy: Conditional Bernoulli Diffusion Models with Consensus-Driven Correction for Thin Structure Segmentation2025-07-17Unleashing Vision Foundation Models for Coronary Artery Segmentation: Parallel ViT-CNN Encoding and Variational Fusion2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17Unified Medical Image Segmentation with State Space Modeling Snake2025-07-17