TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Multi-Attention Network for Compressed Video Referring Obj...

Multi-Attention Network for Compressed Video Referring Object Segmentation

Weidong Chen, Dexiang Hong, Yuankai Qi, Zhenjun Han, Shuhui Wang, Laiyun Qing, Qingming Huang, Guorong Li

2022-07-26Referring Video Object SegmentationReferring Expression SegmentationSegmentationSemantic SegmentationVideo Object SegmentationVideo Semantic Segmentation
PaperPDFCode(official)

Abstract

Referring video object segmentation aims to segment the object referred by a given language expression. Existing works typically require compressed video bitstream to be decoded to RGB frames before being segmented, which increases computation and storage requirements and ultimately slows the inference down. This may hamper its application in real-world computing resource limited scenarios, such as autonomous cars and drones. To alleviate this problem, in this paper, we explore the referring object segmentation task on compressed videos, namely on the original video data flow. Besides the inherent difficulty of the video referring object segmentation task itself, obtaining discriminative representation from compressed video is also rather challenging. To address this problem, we propose a multi-attention network which consists of dual-path dual-attention module and a query-based cross-modal Transformer module. Specifically, the dual-path dual-attention module is designed to extract effective representation from compressed data in three modalities, i.e., I-frame, Motion Vector and Residual. The query-based cross-modal Transformer firstly models the correlation between linguistic and visual modalities, and then the fused multi-modality features are used to guide object queries to generate a content-aware dynamic kernel and to predict final segmentation masks. Different from previous works, we propose to learn just one kernel, which thus removes the complicated post mask-matching procedure of existing methods. Extensive promising experimental results on three challenging datasets show the effectiveness of our method compared against several state-of-the-art methods which are proposed for processing RGB data. Source code is available at: https://github.com/DexiangHong/MANet.

Results

TaskDatasetMetricValueModel
Instance SegmentationRefer-YouTube-VOS (2021 public validation)F56.51MANET
Instance SegmentationRefer-YouTube-VOS (2021 public validation)J54.75MANET
Instance SegmentationRefer-YouTube-VOS (2021 public validation)J&F55.63MANET
Instance SegmentationA2D SentencesAP0.471MANET
Instance SegmentationA2D SentencesIoU mean0.632MANET
Instance SegmentationA2D SentencesIoU overall0.726MANET
Instance SegmentationA2D SentencesPrecision@0.50.734MANET
Instance SegmentationA2D SentencesPrecision@0.60.682MANET
Instance SegmentationA2D SentencesPrecision@0.70.579MANET
Instance SegmentationA2D SentencesPrecision@0.80.389MANET
Instance SegmentationA2D SentencesPrecision@0.90.132MANET
Referring Expression SegmentationRefer-YouTube-VOS (2021 public validation)F56.51MANET
Referring Expression SegmentationRefer-YouTube-VOS (2021 public validation)J54.75MANET
Referring Expression SegmentationRefer-YouTube-VOS (2021 public validation)J&F55.63MANET
Referring Expression SegmentationA2D SentencesAP0.471MANET
Referring Expression SegmentationA2D SentencesIoU mean0.632MANET
Referring Expression SegmentationA2D SentencesIoU overall0.726MANET
Referring Expression SegmentationA2D SentencesPrecision@0.50.734MANET
Referring Expression SegmentationA2D SentencesPrecision@0.60.682MANET
Referring Expression SegmentationA2D SentencesPrecision@0.70.579MANET
Referring Expression SegmentationA2D SentencesPrecision@0.80.389MANET
Referring Expression SegmentationA2D SentencesPrecision@0.90.132MANET

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21Deep Learning-Based Fetal Lung Segmentation from Diffusion-weighted MRI Images and Lung Maturity Evaluation for Fetal Growth Restriction2025-07-17DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17From Variability To Accuracy: Conditional Bernoulli Diffusion Models with Consensus-Driven Correction for Thin Structure Segmentation2025-07-17Unleashing Vision Foundation Models for Coronary Artery Segmentation: Parallel ViT-CNN Encoding and Variational Fusion2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17Unified Medical Image Segmentation with State Space Modeling Snake2025-07-17A Privacy-Preserving Semantic-Segmentation Method Using Domain-Adaptation Technique2025-07-17