TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/MPG-SAM 2: Adapting SAM 2 with Mask Priors and Global Cont...

MPG-SAM 2: Adapting SAM 2 with Mask Priors and Global Context for Referring Video Object Segmentation

Fu Rong, Meng Lan, Qian Zhang, Lefei Zhang

2025-01-23Referring Video Object SegmentationReferring Expression SegmentationSemantic SegmentationVideo SegmentationVideo Object SegmentationVideo Semantic Segmentation
PaperPDFCode(official)

Abstract

Referring video object segmentation (RVOS) aims to segment objects in a video according to textual descriptions, which requires the integration of multimodal information and temporal dynamics perception. The Segment Anything Model 2 (SAM 2) has shown great effectiveness across various video segmentation tasks. However, its application to offline RVOS is challenged by the translation of the text into effective prompts and a lack of global context awareness. In this paper, we propose a novel RVOS framework, termed MPG-SAM 2, to address these challenges. Specifically, MPG-SAM 2 employs a unified multimodal encoder to jointly encode video and textual features, generating semantically aligned video and text embeddings, along with multimodal class tokens. A mask prior generator utilizes the video embeddings and class tokens to create pseudo masks of target objects and global context. These masks are fed into the prompt encoder as dense prompts along with multimodal class tokens as sparse prompts to generate accurate prompts for SAM 2. To provide the online SAM 2 with a global view, we introduce a hierarchical global-historical aggregator, which allows SAM 2 to aggregate global and historical information of target objects at both pixel and object levels, enhancing the target representation and temporal consistency. Extensive experiments on several RVOS benchmarks demonstrate the superiority of MPG-SAM 2 and the effectiveness of our proposed modules.

Results

TaskDatasetMetricValueModel
VideoMeViSF56.7MPG-SAM 2
VideoMeViSJ50.7MPG-SAM 2
VideoMeViSJ&F53.7MPG-SAM 2
Instance SegmentationRefer-YouTube-VOS (2021 public validation)F76.1MPG-SAM 2
Instance SegmentationRefer-YouTube-VOS (2021 public validation)J71.7MPG-SAM 2
Instance SegmentationRefer-YouTube-VOS (2021 public validation)J&F73.9MPG-SAM 2
Video Object SegmentationMeViSF56.7MPG-SAM 2
Video Object SegmentationMeViSJ50.7MPG-SAM 2
Video Object SegmentationMeViSJ&F53.7MPG-SAM 2
Referring Expression SegmentationRefer-YouTube-VOS (2021 public validation)F76.1MPG-SAM 2
Referring Expression SegmentationRefer-YouTube-VOS (2021 public validation)J71.7MPG-SAM 2
Referring Expression SegmentationRefer-YouTube-VOS (2021 public validation)J&F73.9MPG-SAM 2

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17Unified Medical Image Segmentation with State Space Modeling Snake2025-07-17A Privacy-Preserving Semantic-Segmentation Method Using Domain-Adaptation Technique2025-07-17SAMST: A Transformer framework based on SAM pseudo label filtering for remote sensing semi-supervised semantic segmentation2025-07-16Tomato Multi-Angle Multi-Pose Dataset for Fine-Grained Phenotyping2025-07-15U-RWKV: Lightweight medical image segmentation with direction-adaptive RWKV2025-07-15