Find First, Track Next: Decoupling Identification and Propagation in Referring Video Object Segmentation

Suhwan Cho, Seunghoon Lee, Minhyeok Lee, Jungho Lee, Sangyoun Lee

2025-03-05Referring Video Object Segmentation Segmentation Semantic Segmentation Video Object Segmentation Video Semantic Segmentation

Paper PDF Code(official)

Abstract

Referring video object segmentation aims to segment and track a target object in a video using a natural language prompt. Existing methods typically fuse visual and textual features in a highly entangled manner, processing multi-modal information together to generate per-frame masks. However, this approach often struggles with ambiguous target identification, particularly in scenes with multiple similar objects, and fails to ensure consistent mask propagation across frames. To address these limitations, we introduce FindTrack, a novel decoupled framework that separates target identification from mask propagation. FindTrack first adaptively selects a key frame by balancing segmentation confidence and vision-text alignment, establishing a robust reference for the target object. This reference is then utilized by a dedicated propagation module to track and segment the object across the entire video. By decoupling these processes, FindTrack effectively reduces ambiguities in target association and enhances segmentation consistency. We demonstrate that FindTrack outperforms existing methods on public benchmarks.

Results

Task	Dataset	Metric	Value	Model
Video	MeViS	F	55.9	FindTrack
Video	MeViS	J	50.5	FindTrack
Video	MeViS	J&F	53.2	FindTrack
Video	Refer-YouTube-VOS	F	75.7	FindTrack
Video	Refer-YouTube-VOS	J	71.8	FindTrack
Video	Refer-YouTube-VOS	J&F	73.7	FindTrack
Video	Ref-DAVIS17	F	78.5	FindTrack
Video	Ref-DAVIS17	J	69.9	FindTrack
Video	Ref-DAVIS17	J&F	74.2	FindTrack
Video Object Segmentation	MeViS	F	55.9	FindTrack
Video Object Segmentation	MeViS	J	50.5	FindTrack
Video Object Segmentation	MeViS	J&F	53.2	FindTrack
Video Object Segmentation	Refer-YouTube-VOS	F	75.7	FindTrack
Video Object Segmentation	Refer-YouTube-VOS	J	71.8	FindTrack
Video Object Segmentation	Refer-YouTube-VOS	J&F	73.7	FindTrack
Video Object Segmentation	Ref-DAVIS17	F	78.5	FindTrack
Video Object Segmentation	Ref-DAVIS17	J	69.9	FindTrack
Video Object Segmentation	Ref-DAVIS17	J&F	74.2	FindTrack

Find First, Track Next: Decoupling Identification and Propagation in Referring Video Object Segmentation

Abstract

Results

Related Papers

Find First, Track Next: Decoupling Identification and Propagation in Referring Video Object Segmentation

Abstract

Results

Related Papers