TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/ReferDINO: Referring Video Object Segmentation with Visual...

ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations

Tianming Liang, Kun-Yu Lin, Chaolei Tan, JianGuo Zhang, Wei-Shi Zheng, Jian-Fang Hu

2025-01-24Visual GroundingReferring Video Object SegmentationReferring Expression SegmentationSemantic SegmentationVideo Object SegmentationVideo Semantic Segmentation
PaperPDF

Abstract

Referring video object segmentation (RVOS) aims to segment target objects throughout a video based on a text description. Despite notable progress in recent years, current RVOS models remain struggle to handle complicated object descriptions due to their limited video-language understanding. To address this limitation, we present \textbf{ReferDINO}, an end-to-end RVOS model that inherits strong vision-language understanding from the pretrained visual grounding foundation models, and is further endowed with effective temporal understanding and object segmentation capabilities. In ReferDINO, we contribute three technical innovations for effectively adapting the foundation models to RVOS: 1) an object-consistent temporal enhancer that capitalizes on the pretrained object-text representations to enhance temporal understanding and object consistency; 2) a grounding-guided deformable mask decoder that integrates text and grounding conditions to generate accurate object masks; 3) a confidence-aware query pruning strategy that significantly improves the object decoding efficiency without compromising performance. We conduct extensive experiments on five public RVOS benchmarks to demonstrate that our proposed ReferDINO outperforms state-of-the-art methods significantly. Project page: \url{https://isee-laboratory.github.io/ReferDINO}

Results

TaskDatasetMetricValueModel
VideoMeViSF53.9ReferDINO (Swin-B)
VideoMeViSJ44.7ReferDINO (Swin-B)
VideoMeViSJ&F49.3ReferDINO (Swin-B)
VideoLong-RVOSJ&F48.7ReferDINO
VideoLong-RVOStIoU71.7ReferDINO
VideoLong-RVOSvIoU41.2ReferDINO
Instance SegmentationRefer-YouTube-VOS (2021 public validation)F71.5ReferDINO (Swin-B)
Instance SegmentationRefer-YouTube-VOS (2021 public validation)J67ReferDINO (Swin-B)
Instance SegmentationRefer-YouTube-VOS (2021 public validation)J&F69.3ReferDINO (Swin-B)
Video Object SegmentationMeViSF53.9ReferDINO (Swin-B)
Video Object SegmentationMeViSJ44.7ReferDINO (Swin-B)
Video Object SegmentationMeViSJ&F49.3ReferDINO (Swin-B)
Video Object SegmentationLong-RVOSJ&F48.7ReferDINO
Video Object SegmentationLong-RVOStIoU71.7ReferDINO
Video Object SegmentationLong-RVOSvIoU41.2ReferDINO
Referring Expression SegmentationRefer-YouTube-VOS (2021 public validation)F71.5ReferDINO (Swin-B)
Referring Expression SegmentationRefer-YouTube-VOS (2021 public validation)J67ReferDINO (Swin-B)
Referring Expression SegmentationRefer-YouTube-VOS (2021 public validation)J&F69.3ReferDINO (Swin-B)

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17Unified Medical Image Segmentation with State Space Modeling Snake2025-07-17A Privacy-Preserving Semantic-Segmentation Method Using Domain-Adaptation Technique2025-07-17SAMST: A Transformer framework based on SAM pseudo label filtering for remote sensing semi-supervised semantic segmentation2025-07-16ViewSRD: 3D Visual Grounding via Structured Multi-View Decomposition2025-07-15Tomato Multi-Angle Multi-Pose Dataset for Fine-Grained Phenotyping2025-07-15