TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Video Object Segmentation with Language Referring Expressi...

Video Object Segmentation with Language Referring Expressions

Anna Khoreva, Anna Rohrbach, Bernt Schiele

2018-03-21Semi-Supervised Video Object SegmentationReferring Expression SegmentationSegmentationSemantic SegmentationVideo Object SegmentationVideo Semantic Segmentation
PaperPDF

Abstract

Most state-of-the-art semi-supervised video object segmentation methods rely on a pixel-accurate mask of a target object provided for the first frame of a video. However, obtaining a detailed segmentation mask is expensive and time-consuming. In this work we explore an alternative way of identifying a target object, namely by employing language referring expressions. Besides being a more practical and natural way of pointing out a target object, using language specifications can help to avoid drift as well as make the system more robust to complex dynamics and appearance variations. Leveraging recent advances of language grounding models designed for images, we propose an approach to extend them to video data, ensuring temporally coherent predictions. To evaluate our method we augment the popular video object segmentation benchmarks, DAVIS'16 and DAVIS'17 with language descriptions of target objects. We show that our language-supervised approach performs on par with the methods which have access to a pixel-level mask of the target object on DAVIS'16 and is competitive to methods using scribbles on the challenging DAVIS'17 dataset.

Results

TaskDatasetMetricValueModel
VideoDAVIS 2017J&F62.2VOSwL (Mask+Language)
VideoDAVIS 2017mIoU59VOSwL (Mask+Language)
VideoDAVIS 2016mIoU84.5VOSwL (Mask+Language)
VideoDAVIS 2016mIoU82.8VOSwL (Language)
VideoDAVIS 2017 (val)J&F60.8VOSwL (Language)
VideoDAVIS 2017 (val)Jaccard (Mean)58VOSwL (Language)
VideoDAVIS 2017 (val)F-measure (Decay)24.5VOSwL
VideoDAVIS 2017 (val)F-measure (Mean)63.5VOSwL
VideoDAVIS 2017 (val)F-measure (Recall)70.4VOSwL
VideoDAVIS 2017 (val)Jaccard (Decay)22.4VOSwL
VideoDAVIS 2017 (val)Jaccard (Recall)66.1VOSwL
VideoDAVIS 2016F-measure (Decay)8.6VOSwL
VideoDAVIS 2016F-measure (Mean)84.2VOSwL
VideoDAVIS 2016F-measure (Recall)93.9VOSwL
VideoDAVIS 2016J&F83.65VOSwL
VideoDAVIS 2016Jaccard (Decay)6.9VOSwL
VideoDAVIS 2016Jaccard (Mean)83.1VOSwL
VideoDAVIS 2016Jaccard (Recall)95.7VOSwL
Instance SegmentationDAVIS 2017 (val)J&F 1st frame39.3Khoreva et al.
Instance SegmentationDAVIS 2017 (val)J&F Full video37.1Khoreva et al.
Video Object SegmentationDAVIS 2017J&F62.2VOSwL (Mask+Language)
Video Object SegmentationDAVIS 2017mIoU59VOSwL (Mask+Language)
Video Object SegmentationDAVIS 2016mIoU84.5VOSwL (Mask+Language)
Video Object SegmentationDAVIS 2016mIoU82.8VOSwL (Language)
Video Object SegmentationDAVIS 2017 (val)J&F60.8VOSwL (Language)
Video Object SegmentationDAVIS 2017 (val)Jaccard (Mean)58VOSwL (Language)
Video Object SegmentationDAVIS 2017 (val)F-measure (Decay)24.5VOSwL
Video Object SegmentationDAVIS 2017 (val)F-measure (Mean)63.5VOSwL
Video Object SegmentationDAVIS 2017 (val)F-measure (Recall)70.4VOSwL
Video Object SegmentationDAVIS 2017 (val)Jaccard (Decay)22.4VOSwL
Video Object SegmentationDAVIS 2017 (val)Jaccard (Recall)66.1VOSwL
Video Object SegmentationDAVIS 2016F-measure (Decay)8.6VOSwL
Video Object SegmentationDAVIS 2016F-measure (Mean)84.2VOSwL
Video Object SegmentationDAVIS 2016F-measure (Recall)93.9VOSwL
Video Object SegmentationDAVIS 2016J&F83.65VOSwL
Video Object SegmentationDAVIS 2016Jaccard (Decay)6.9VOSwL
Video Object SegmentationDAVIS 2016Jaccard (Mean)83.1VOSwL
Video Object SegmentationDAVIS 2016Jaccard (Recall)95.7VOSwL
Referring Expression SegmentationDAVIS 2017 (val)J&F 1st frame39.3Khoreva et al.
Referring Expression SegmentationDAVIS 2017 (val)J&F Full video37.1Khoreva et al.
Semi-Supervised Video Object SegmentationDAVIS 2017 (val)J&F60.8VOSwL (Language)
Semi-Supervised Video Object SegmentationDAVIS 2017 (val)Jaccard (Mean)58VOSwL (Language)
Semi-Supervised Video Object SegmentationDAVIS 2017 (val)F-measure (Decay)24.5VOSwL
Semi-Supervised Video Object SegmentationDAVIS 2017 (val)F-measure (Mean)63.5VOSwL
Semi-Supervised Video Object SegmentationDAVIS 2017 (val)F-measure (Recall)70.4VOSwL
Semi-Supervised Video Object SegmentationDAVIS 2017 (val)Jaccard (Decay)22.4VOSwL
Semi-Supervised Video Object SegmentationDAVIS 2017 (val)Jaccard (Recall)66.1VOSwL
Semi-Supervised Video Object SegmentationDAVIS 2016F-measure (Decay)8.6VOSwL
Semi-Supervised Video Object SegmentationDAVIS 2016F-measure (Mean)84.2VOSwL
Semi-Supervised Video Object SegmentationDAVIS 2016F-measure (Recall)93.9VOSwL
Semi-Supervised Video Object SegmentationDAVIS 2016J&F83.65VOSwL
Semi-Supervised Video Object SegmentationDAVIS 2016Jaccard (Decay)6.9VOSwL
Semi-Supervised Video Object SegmentationDAVIS 2016Jaccard (Mean)83.1VOSwL
Semi-Supervised Video Object SegmentationDAVIS 2016Jaccard (Recall)95.7VOSwL

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21Deep Learning-Based Fetal Lung Segmentation from Diffusion-weighted MRI Images and Lung Maturity Evaluation for Fetal Growth Restriction2025-07-17DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17From Variability To Accuracy: Conditional Bernoulli Diffusion Models with Consensus-Driven Correction for Thin Structure Segmentation2025-07-17Unleashing Vision Foundation Models for Coronary Artery Segmentation: Parallel ViT-CNN Encoding and Variational Fusion2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17Unified Medical Image Segmentation with State Space Modeling Snake2025-07-17A Privacy-Preserving Semantic-Segmentation Method Using Domain-Adaptation Technique2025-07-17