TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/SOC: Semantic-Assisted Object Cluster for Referring Video ...

SOC: Semantic-Assisted Object Cluster for Referring Video Object Segmentation

Zhuoyan Luo, Yicheng Xiao, Yong liu, Shuyan Li, Yitong Wang, Yansong Tang, Xiu Li, Yujiu Yang

2023-05-26NeurIPS 2023 11Referring Video Object Segmentationcross-modal alignmentReferring Expression SegmentationSegmentationSemantic SegmentationVideo Object SegmentationVideo Semantic Segmentation
PaperPDFCode(official)

Abstract

This paper studies referring video object segmentation (RVOS) by boosting video-level visual-linguistic alignment. Recent approaches model the RVOS task as a sequence prediction problem and perform multi-modal interaction as well as segmentation for each frame separately. However, the lack of a global view of video content leads to difficulties in effectively utilizing inter-frame relationships and understanding textual descriptions of object temporal variations. To address this issue, we propose Semantic-assisted Object Cluster (SOC), which aggregates video content and textual guidance for unified temporal modeling and cross-modal alignment. By associating a group of frame-level object embeddings with language tokens, SOC facilitates joint space learning across modalities and time steps. Moreover, we present multi-modal contrastive supervision to help construct well-aligned joint space at the video level. We conduct extensive experiments on popular RVOS benchmarks, and our method outperforms state-of-the-art competitors on all benchmarks by a remarkable margin. Besides, the emphasis on temporal coherence enhances the segmentation stability and adaptability of our method in processing text expressions with temporal variations. Code will be available.

Results

TaskDatasetMetricValueModel
VideoRefer-YouTube-VOSF67.9SOC
VideoRefer-YouTube-VOSJ64.1SOC
VideoRefer-YouTube-VOSJ&F66SOC
VideoRef-DAVIS17F69.1SOC
VideoRef-DAVIS17J62.5SOC
VideoRef-DAVIS17J&F65.8SOC
VideoLong-RVOSJ&F34.9SOC
VideoLong-RVOStIoU68.1SOC
VideoLong-RVOSvIoU28.6SOC
Instance SegmentationRefer-YouTube-VOS (2021 public validation)F69.3SOC (Joint training, Video-Swin-B)
Instance SegmentationRefer-YouTube-VOS (2021 public validation)J65.3SOC (Joint training, Video-Swin-B)
Instance SegmentationRefer-YouTube-VOS (2021 public validation)F60.5SOC (Video-Swin-T)
Instance SegmentationRefer-YouTube-VOS (2021 public validation)J57.8SOC (Video-Swin-T)
Instance SegmentationRefer-YouTube-VOS (2021 public validation)J&F59.2SOC (Video-Swin-T)
Instance SegmentationA2D SentencesAP0.573SOC (Video-Swin-B)
Instance SegmentationA2D SentencesIoU mean0.725SOC (Video-Swin-B)
Instance SegmentationA2D SentencesIoU overall0.807SOC (Video-Swin-B)
Instance SegmentationA2D SentencesPrecision@0.50.851SOC (Video-Swin-B)
Instance SegmentationA2D SentencesPrecision@0.60.827SOC (Video-Swin-B)
Instance SegmentationA2D SentencesPrecision@0.70.765SOC (Video-Swin-B)
Instance SegmentationA2D SentencesPrecision@0.80.607SOC (Video-Swin-B)
Instance SegmentationA2D SentencesPrecision@0.90.252SOC (Video-Swin-B)
Instance SegmentationA2D SentencesAP0.504SOC (Video-Swin-T)
Instance SegmentationA2D SentencesIoU mean0.669SOC (Video-Swin-T)
Instance SegmentationA2D SentencesIoU overall0.747SOC (Video-Swin-T)
Instance SegmentationA2D SentencesPrecision@0.50.79SOC (Video-Swin-T)
Instance SegmentationA2D SentencesPrecision@0.60.756SOC (Video-Swin-T)
Instance SegmentationA2D SentencesPrecision@0.70.687SOC (Video-Swin-T)
Instance SegmentationA2D SentencesPrecision@0.80.535SOC (Video-Swin-T)
Instance SegmentationA2D SentencesPrecision@0.90.195SOC (Video-Swin-T)
Instance SegmentationJ-HMDBAP0.446SOC (Video-Swin-B)
Instance SegmentationJ-HMDBIoU mean0.723SOC (Video-Swin-B)
Instance SegmentationJ-HMDBIoU overall0.736SOC (Video-Swin-B)
Instance SegmentationJ-HMDBPrecision@0.50.969SOC (Video-Swin-B)
Instance SegmentationJ-HMDBPrecision@0.60.914SOC (Video-Swin-B)
Instance SegmentationJ-HMDBPrecision@0.70.711SOC (Video-Swin-B)
Instance SegmentationJ-HMDBPrecision@0.80.213SOC (Video-Swin-B)
Instance SegmentationJ-HMDBPrecision@0.90.001SOC (Video-Swin-B)
Instance SegmentationJ-HMDBAP0.397SOC (Video-Swin-T)
Instance SegmentationJ-HMDBIoU mean0.701SOC (Video-Swin-T)
Instance SegmentationJ-HMDBIoU overall0.707SOC (Video-Swin-T)
Instance SegmentationJ-HMDBPrecision@0.50.947SOC (Video-Swin-T)
Instance SegmentationJ-HMDBPrecision@0.60.864SOC (Video-Swin-T)
Instance SegmentationJ-HMDBPrecision@0.70.627SOC (Video-Swin-T)
Instance SegmentationJ-HMDBPrecision@0.80.179SOC (Video-Swin-T)
Instance SegmentationJ-HMDBPrecision@0.90.001SOC (Video-Swin-T)
Video Object SegmentationRefer-YouTube-VOSF67.9SOC
Video Object SegmentationRefer-YouTube-VOSJ64.1SOC
Video Object SegmentationRefer-YouTube-VOSJ&F66SOC
Video Object SegmentationRef-DAVIS17F69.1SOC
Video Object SegmentationRef-DAVIS17J62.5SOC
Video Object SegmentationRef-DAVIS17J&F65.8SOC
Video Object SegmentationLong-RVOSJ&F34.9SOC
Video Object SegmentationLong-RVOStIoU68.1SOC
Video Object SegmentationLong-RVOSvIoU28.6SOC
Referring Expression SegmentationRefer-YouTube-VOS (2021 public validation)F69.3SOC (Joint training, Video-Swin-B)
Referring Expression SegmentationRefer-YouTube-VOS (2021 public validation)J65.3SOC (Joint training, Video-Swin-B)
Referring Expression SegmentationRefer-YouTube-VOS (2021 public validation)F60.5SOC (Video-Swin-T)
Referring Expression SegmentationRefer-YouTube-VOS (2021 public validation)J57.8SOC (Video-Swin-T)
Referring Expression SegmentationRefer-YouTube-VOS (2021 public validation)J&F59.2SOC (Video-Swin-T)
Referring Expression SegmentationA2D SentencesAP0.573SOC (Video-Swin-B)
Referring Expression SegmentationA2D SentencesIoU mean0.725SOC (Video-Swin-B)
Referring Expression SegmentationA2D SentencesIoU overall0.807SOC (Video-Swin-B)
Referring Expression SegmentationA2D SentencesPrecision@0.50.851SOC (Video-Swin-B)
Referring Expression SegmentationA2D SentencesPrecision@0.60.827SOC (Video-Swin-B)
Referring Expression SegmentationA2D SentencesPrecision@0.70.765SOC (Video-Swin-B)
Referring Expression SegmentationA2D SentencesPrecision@0.80.607SOC (Video-Swin-B)
Referring Expression SegmentationA2D SentencesPrecision@0.90.252SOC (Video-Swin-B)
Referring Expression SegmentationA2D SentencesAP0.504SOC (Video-Swin-T)
Referring Expression SegmentationA2D SentencesIoU mean0.669SOC (Video-Swin-T)
Referring Expression SegmentationA2D SentencesIoU overall0.747SOC (Video-Swin-T)
Referring Expression SegmentationA2D SentencesPrecision@0.50.79SOC (Video-Swin-T)
Referring Expression SegmentationA2D SentencesPrecision@0.60.756SOC (Video-Swin-T)
Referring Expression SegmentationA2D SentencesPrecision@0.70.687SOC (Video-Swin-T)
Referring Expression SegmentationA2D SentencesPrecision@0.80.535SOC (Video-Swin-T)
Referring Expression SegmentationA2D SentencesPrecision@0.90.195SOC (Video-Swin-T)
Referring Expression SegmentationJ-HMDBAP0.446SOC (Video-Swin-B)
Referring Expression SegmentationJ-HMDBIoU mean0.723SOC (Video-Swin-B)
Referring Expression SegmentationJ-HMDBIoU overall0.736SOC (Video-Swin-B)
Referring Expression SegmentationJ-HMDBPrecision@0.50.969SOC (Video-Swin-B)
Referring Expression SegmentationJ-HMDBPrecision@0.60.914SOC (Video-Swin-B)
Referring Expression SegmentationJ-HMDBPrecision@0.70.711SOC (Video-Swin-B)
Referring Expression SegmentationJ-HMDBPrecision@0.80.213SOC (Video-Swin-B)
Referring Expression SegmentationJ-HMDBPrecision@0.90.001SOC (Video-Swin-B)
Referring Expression SegmentationJ-HMDBAP0.397SOC (Video-Swin-T)
Referring Expression SegmentationJ-HMDBIoU mean0.701SOC (Video-Swin-T)
Referring Expression SegmentationJ-HMDBIoU overall0.707SOC (Video-Swin-T)
Referring Expression SegmentationJ-HMDBPrecision@0.50.947SOC (Video-Swin-T)
Referring Expression SegmentationJ-HMDBPrecision@0.60.864SOC (Video-Swin-T)
Referring Expression SegmentationJ-HMDBPrecision@0.70.627SOC (Video-Swin-T)
Referring Expression SegmentationJ-HMDBPrecision@0.80.179SOC (Video-Swin-T)
Referring Expression SegmentationJ-HMDBPrecision@0.90.001SOC (Video-Swin-T)

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21Transformer-based Spatial Grounding: A Comprehensive Survey2025-07-17Deep Learning-Based Fetal Lung Segmentation from Diffusion-weighted MRI Images and Lung Maturity Evaluation for Fetal Growth Restriction2025-07-17DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17From Variability To Accuracy: Conditional Bernoulli Diffusion Models with Consensus-Driven Correction for Thin Structure Segmentation2025-07-17Unleashing Vision Foundation Models for Coronary Artery Segmentation: Parallel ViT-CNN Encoding and Variational Fusion2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17Unified Medical Image Segmentation with State Space Modeling Snake2025-07-17