TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Towards Robust Referring Video Object Segmentation with Cy...

Towards Robust Referring Video Object Segmentation with Cyclic Relational Consensus

Xiang Li, Jinglu Wang, Xiaohao Xu, Xiao Li, Bhiksha Raj, Yan Lu

2022-07-04Referring Video Object SegmentationReferring Expression SegmentationSemantic SegmentationVideo Object SegmentationVideo Semantic Segmentation
PaperPDFCode(official)

Abstract

Referring Video Object Segmentation (R-VOS) is a challenging task that aims to segment an object in a video based on a linguistic expression. Most existing R-VOS methods have a critical assumption: the object referred to must appear in the video. This assumption, which we refer to as semantic consensus, is often violated in real-world scenarios, where the expression may be queried against false videos. In this work, we highlight the need for a robust R-VOS model that can handle semantic mismatches. Accordingly, we propose an extended task called Robust R-VOS, which accepts unpaired video-text inputs. We tackle this problem by jointly modeling the primary R-VOS problem and its dual (text reconstruction). A structural text-to-text cycle constraint is introduced to discriminate semantic consensus between video-text pairs and impose it in positive pairs, thereby achieving multi-modal alignment from both positive and negative pairs. Our structural constraint effectively addresses the challenge posed by linguistic diversity, overcoming the limitations of previous methods that relied on the point-wise constraint. A new evaluation dataset, R\textsuperscript{2}-Youtube-VOSis constructed to measure the model robustness. Our model achieves state-of-the-art performance on R-VOS benchmarks, Ref-DAVIS17 and Ref-Youtube-VOS, and also our R\textsuperscript{2}-Youtube-VOS~dataset.

Results

TaskDatasetMetricValueModel
VideoRefer-YouTube-VOSF61.5R2VOS (Swin-T)
VideoRefer-YouTube-VOSJ58.9R2VOS (Swin-T)
VideoRefer-YouTube-VOSJ&F60.2R2VOS (Swin-T)
Instance SegmentationRefer-YouTube-VOS (2021 public validation)F63.1R2VOS (Video-Swin-T)
Instance SegmentationRefer-YouTube-VOS (2021 public validation)J59.6R2VOS (Video-Swin-T)
Instance SegmentationRefer-YouTube-VOS (2021 public validation)J&F61.3R2VOS (Video-Swin-T)
Video Object SegmentationRefer-YouTube-VOSF61.5R2VOS (Swin-T)
Video Object SegmentationRefer-YouTube-VOSJ58.9R2VOS (Swin-T)
Video Object SegmentationRefer-YouTube-VOSJ&F60.2R2VOS (Swin-T)
Referring Expression SegmentationRefer-YouTube-VOS (2021 public validation)F63.1R2VOS (Video-Swin-T)
Referring Expression SegmentationRefer-YouTube-VOS (2021 public validation)J59.6R2VOS (Video-Swin-T)
Referring Expression SegmentationRefer-YouTube-VOS (2021 public validation)J&F61.3R2VOS (Video-Swin-T)

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17Unified Medical Image Segmentation with State Space Modeling Snake2025-07-17A Privacy-Preserving Semantic-Segmentation Method Using Domain-Adaptation Technique2025-07-17SAMST: A Transformer framework based on SAM pseudo label filtering for remote sensing semi-supervised semantic segmentation2025-07-16Tomato Multi-Angle Multi-Pose Dataset for Fine-Grained Phenotyping2025-07-15U-RWKV: Lightweight medical image segmentation with direction-adaptive RWKV2025-07-15