TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Spectrum-guided Multi-granularity Referring Video Object S...

Spectrum-guided Multi-granularity Referring Video Object Segmentation

Bo Miao, Mohammed Bennamoun, Yongsheng Gao, Ajmal Mian

2023-07-25ICCV 2023 1Referring Video Object SegmentationReferring Expression SegmentationSegmentationSemantic SegmentationVideo Object SegmentationVideo Semantic Segmentation
PaperPDFCode(official)

Abstract

Current referring video object segmentation (R-VOS) techniques extract conditional kernels from encoded (low-resolution) vision-language features to segment the decoded high-resolution features. We discovered that this causes significant feature drift, which the segmentation kernels struggle to perceive during the forward computation. This negatively affects the ability of segmentation kernels. To address the drift problem, we propose a Spectrum-guided Multi-granularity (SgMg) approach, which performs direct segmentation on the encoded features and employs visual details to further optimize the masks. In addition, we propose Spectrum-guided Cross-modal Fusion (SCF) to perform intra-frame global interactions in the spectral domain for effective multimodal representation. Finally, we extend SgMg to perform multi-object R-VOS, a new paradigm that enables simultaneous segmentation of multiple referred objects in a video. This not only makes R-VOS faster, but also more practical. Extensive experiments show that SgMg achieves state-of-the-art performance on four video benchmark datasets, outperforming the nearest competitor by 2.8% points on Ref-YouTube-VOS. Our extended SgMg enables multi-object R-VOS, runs about 3 times faster while maintaining satisfactory performance. Code is available at https://github.com/bo-miao/SgMg.

Results

TaskDatasetMetricValueModel
VideoRefer-YouTube-VOSF67.4SgMg
VideoRefer-YouTube-VOSJ63.9SgMg
VideoRefer-YouTube-VOSJ&F65.7SgMg
VideoRef-DAVIS17F66SgMg
VideoRef-DAVIS17J60.6SgMg
VideoRef-DAVIS17J&F63.3SgMg
Instance SegmentationRefer-YouTube-VOS (2021 public validation)F67.4SgMg (Pre-training)
Instance SegmentationRefer-YouTube-VOS (2021 public validation)J63.9SgMg (Pre-training)
Instance SegmentationRefer-YouTube-VOS (2021 public validation)J&F65.7SgMg (Pre-training)
Instance SegmentationA2D SentencesAP0.585SgMg (Video-Swin-B)
Instance SegmentationA2D SentencesIoU mean0.72SgMg (Video-Swin-B)
Instance SegmentationA2D SentencesIoU overall0.799SgMg (Video-Swin-B)
Instance SegmentationA2D SentencesPrecision@0.50.843SgMg (Video-Swin-B)
Instance SegmentationA2D SentencesPrecision@0.60.822SgMg (Video-Swin-B)
Instance SegmentationA2D SentencesPrecision@0.70.767SgMg (Video-Swin-B)
Instance SegmentationA2D SentencesPrecision@0.80.617SgMg (Video-Swin-B)
Instance SegmentationA2D SentencesPrecision@0.90.259SgMg (Video-Swin-B)
Instance SegmentationJ-HMDBAP0.45SgMg (Video-Swin-B)
Instance SegmentationJ-HMDBIoU mean0.725SgMg (Video-Swin-B)
Instance SegmentationJ-HMDBIoU overall0.737SgMg (Video-Swin-B)
Instance SegmentationJ-HMDBPrecision@0.50.972SgMg (Video-Swin-B)
Instance SegmentationJ-HMDBPrecision@0.60.917SgMg (Video-Swin-B)
Instance SegmentationJ-HMDBPrecision@0.70.714SgMg (Video-Swin-B)
Instance SegmentationJ-HMDBPrecision@0.80.225SgMg (Video-Swin-B)
Instance SegmentationJ-HMDBPrecision@0.90.003SgMg (Video-Swin-B)
Instance SegmentationDAVIS 2017 (val)J&F 1st frame63.3SgMg
Video Object SegmentationRefer-YouTube-VOSF67.4SgMg
Video Object SegmentationRefer-YouTube-VOSJ63.9SgMg
Video Object SegmentationRefer-YouTube-VOSJ&F65.7SgMg
Video Object SegmentationRef-DAVIS17F66SgMg
Video Object SegmentationRef-DAVIS17J60.6SgMg
Video Object SegmentationRef-DAVIS17J&F63.3SgMg
Referring Expression SegmentationRefer-YouTube-VOS (2021 public validation)F67.4SgMg (Pre-training)
Referring Expression SegmentationRefer-YouTube-VOS (2021 public validation)J63.9SgMg (Pre-training)
Referring Expression SegmentationRefer-YouTube-VOS (2021 public validation)J&F65.7SgMg (Pre-training)
Referring Expression SegmentationA2D SentencesAP0.585SgMg (Video-Swin-B)
Referring Expression SegmentationA2D SentencesIoU mean0.72SgMg (Video-Swin-B)
Referring Expression SegmentationA2D SentencesIoU overall0.799SgMg (Video-Swin-B)
Referring Expression SegmentationA2D SentencesPrecision@0.50.843SgMg (Video-Swin-B)
Referring Expression SegmentationA2D SentencesPrecision@0.60.822SgMg (Video-Swin-B)
Referring Expression SegmentationA2D SentencesPrecision@0.70.767SgMg (Video-Swin-B)
Referring Expression SegmentationA2D SentencesPrecision@0.80.617SgMg (Video-Swin-B)
Referring Expression SegmentationA2D SentencesPrecision@0.90.259SgMg (Video-Swin-B)
Referring Expression SegmentationJ-HMDBAP0.45SgMg (Video-Swin-B)
Referring Expression SegmentationJ-HMDBIoU mean0.725SgMg (Video-Swin-B)
Referring Expression SegmentationJ-HMDBIoU overall0.737SgMg (Video-Swin-B)
Referring Expression SegmentationJ-HMDBPrecision@0.50.972SgMg (Video-Swin-B)
Referring Expression SegmentationJ-HMDBPrecision@0.60.917SgMg (Video-Swin-B)
Referring Expression SegmentationJ-HMDBPrecision@0.70.714SgMg (Video-Swin-B)
Referring Expression SegmentationJ-HMDBPrecision@0.80.225SgMg (Video-Swin-B)
Referring Expression SegmentationJ-HMDBPrecision@0.90.003SgMg (Video-Swin-B)
Referring Expression SegmentationDAVIS 2017 (val)J&F 1st frame63.3SgMg

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21Deep Learning-Based Fetal Lung Segmentation from Diffusion-weighted MRI Images and Lung Maturity Evaluation for Fetal Growth Restriction2025-07-17DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17From Variability To Accuracy: Conditional Bernoulli Diffusion Models with Consensus-Driven Correction for Thin Structure Segmentation2025-07-17Unleashing Vision Foundation Models for Coronary Artery Segmentation: Parallel ViT-CNN Encoding and Variational Fusion2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17Unified Medical Image Segmentation with State Space Modeling Snake2025-07-17A Privacy-Preserving Semantic-Segmentation Method Using Domain-Adaptation Technique2025-07-17