MPG-SAM 2: Adapting SAM 2 with Mask Priors and Global Context for Referring Video Object Segmentation

Fu Rong, Meng Lan, Qian Zhang, Lefei Zhang

2025-01-23Referring Video Object Segmentation Referring Expression Segmentation Semantic Segmentation Video Segmentation Video Object Segmentation Video Semantic Segmentation

Paper PDF Code(official)

Abstract

Referring video object segmentation (RVOS) aims to segment objects in a video according to textual descriptions, which requires the integration of multimodal information and temporal dynamics perception. The Segment Anything Model 2 (SAM 2) has shown great effectiveness across various video segmentation tasks. However, its application to offline RVOS is challenged by the translation of the text into effective prompts and a lack of global context awareness. In this paper, we propose a novel RVOS framework, termed MPG-SAM 2, to address these challenges. Specifically, MPG-SAM 2 employs a unified multimodal encoder to jointly encode video and textual features, generating semantically aligned video and text embeddings, along with multimodal class tokens. A mask prior generator utilizes the video embeddings and class tokens to create pseudo masks of target objects and global context. These masks are fed into the prompt encoder as dense prompts along with multimodal class tokens as sparse prompts to generate accurate prompts for SAM 2. To provide the online SAM 2 with a global view, we introduce a hierarchical global-historical aggregator, which allows SAM 2 to aggregate global and historical information of target objects at both pixel and object levels, enhancing the target representation and temporal consistency. Extensive experiments on several RVOS benchmarks demonstrate the superiority of MPG-SAM 2 and the effectiveness of our proposed modules.

Results

Task	Dataset	Metric	Value	Model
Video	MeViS	F	56.7	MPG-SAM 2
Video	MeViS	J	50.7	MPG-SAM 2
Video	MeViS	J&F	53.7	MPG-SAM 2
Instance Segmentation	Refer-YouTube-VOS (2021 public validation)	F	76.1	MPG-SAM 2
Instance Segmentation	Refer-YouTube-VOS (2021 public validation)	J	71.7	MPG-SAM 2
Instance Segmentation	Refer-YouTube-VOS (2021 public validation)	J&F	73.9	MPG-SAM 2
Video Object Segmentation	MeViS	F	56.7	MPG-SAM 2
Video Object Segmentation	MeViS	J	50.7	MPG-SAM 2
Video Object Segmentation	MeViS	J&F	53.7	MPG-SAM 2
Referring Expression Segmentation	Refer-YouTube-VOS (2021 public validation)	F	76.1	MPG-SAM 2
Referring Expression Segmentation	Refer-YouTube-VOS (2021 public validation)	J	71.7	MPG-SAM 2
Referring Expression Segmentation	Refer-YouTube-VOS (2021 public validation)	J&F	73.9	MPG-SAM 2

MPG-SAM 2: Adapting SAM 2 with Mask Priors and Global Context for Referring Video Object Segmentation

Abstract

Results

Related Papers

MPG-SAM 2: Adapting SAM 2 with Mask Priors and Global Context for Referring Video Object Segmentation

Abstract

Results

Related Papers