ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations

Tianming Liang, Kun-Yu Lin, Chaolei Tan, JianGuo Zhang, Wei-Shi Zheng, Jian-Fang Hu

2025-01-24Visual Grounding Referring Video Object Segmentation Referring Expression Segmentation Semantic Segmentation Video Object Segmentation Video Semantic Segmentation

Paper PDF

Abstract

Referring video object segmentation (RVOS) aims to segment target objects throughout a video based on a text description. Despite notable progress in recent years, current RVOS models remain struggle to handle complicated object descriptions due to their limited video-language understanding. To address this limitation, we present \textbf{ReferDINO}, an end-to-end RVOS model that inherits strong vision-language understanding from the pretrained visual grounding foundation models, and is further endowed with effective temporal understanding and object segmentation capabilities. In ReferDINO, we contribute three technical innovations for effectively adapting the foundation models to RVOS: 1) an object-consistent temporal enhancer that capitalizes on the pretrained object-text representations to enhance temporal understanding and object consistency; 2) a grounding-guided deformable mask decoder that integrates text and grounding conditions to generate accurate object masks; 3) a confidence-aware query pruning strategy that significantly improves the object decoding efficiency without compromising performance. We conduct extensive experiments on five public RVOS benchmarks to demonstrate that our proposed ReferDINO outperforms state-of-the-art methods significantly. Project page: \url{https://isee-laboratory.github.io/ReferDINO}

Results

Task	Dataset	Metric	Value	Model
Video	MeViS	F	53.9	ReferDINO (Swin-B)
Video	MeViS	J	44.7	ReferDINO (Swin-B)
Video	MeViS	J&F	49.3	ReferDINO (Swin-B)
Video	Long-RVOS	J&F	48.7	ReferDINO
Video	Long-RVOS	tIoU	71.7	ReferDINO
Video	Long-RVOS	vIoU	41.2	ReferDINO
Instance Segmentation	Refer-YouTube-VOS (2021 public validation)	F	71.5	ReferDINO (Swin-B)
Instance Segmentation	Refer-YouTube-VOS (2021 public validation)	J	67	ReferDINO (Swin-B)
Instance Segmentation	Refer-YouTube-VOS (2021 public validation)	J&F	69.3	ReferDINO (Swin-B)
Video Object Segmentation	MeViS	F	53.9	ReferDINO (Swin-B)
Video Object Segmentation	MeViS	J	44.7	ReferDINO (Swin-B)
Video Object Segmentation	MeViS	J&F	49.3	ReferDINO (Swin-B)
Video Object Segmentation	Long-RVOS	J&F	48.7	ReferDINO
Video Object Segmentation	Long-RVOS	tIoU	71.7	ReferDINO
Video Object Segmentation	Long-RVOS	vIoU	41.2	ReferDINO
Referring Expression Segmentation	Refer-YouTube-VOS (2021 public validation)	F	71.5	ReferDINO (Swin-B)
Referring Expression Segmentation	Refer-YouTube-VOS (2021 public validation)	J	67	ReferDINO (Swin-B)
Referring Expression Segmentation	Refer-YouTube-VOS (2021 public validation)	J&F	69.3	ReferDINO (Swin-B)

ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations

Abstract

Results

Related Papers

ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations

Abstract

Results

Related Papers