TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/LoSh: Long-Short Text Joint Prediction Network for Referri...

LoSh: Long-Short Text Joint Prediction Network for Referring Video Object Segmentation

Linfeng Yuan, Miaojing Shi, Zijie Yue, Qijun Chen

2023-06-14CVPR 2024 1Referring Video Object SegmentationReferring Expression SegmentationSemantic SegmentationVideo Object SegmentationVideo Semantic Segmentation
PaperPDFCode(official)

Abstract

Referring video object segmentation (RVOS) aims to segment the target instance referred by a given text expression in a video clip. The text expression normally contains sophisticated description of the instance's appearance, action, and relation with others. It is therefore rather difficult for a RVOS model to capture all these attributes correspondingly in the video; in fact, the model often favours more on the action- and relation-related visual attributes of the instance. This can end up with partial or even incorrect mask prediction of the target instance. We tackle this problem by taking a subject-centric short text expression from the original long text expression. The short one retains only the appearance-related information of the target instance so that we can use it to focus the model's attention on the instance's appearance. We let the model make joint predictions using both long and short text expressions; and insert a long-short cross-attention module to interact the joint features and a long-short predictions intersection loss to regulate the joint predictions. Besides the improvement on the linguistic part, we also introduce a forward-backward visual consistency loss, which utilizes optical flows to warp visual features between the annotated frames and their temporal neighbors for consistency. We build our method on top of two state of the art pipelines. Extensive experiments on A2D-Sentences, Refer-YouTube-VOS, JHMDB-Sentences and Refer-DAVIS17 show impressive improvements of our method.Code is available at https://github.com/LinfengYuan1997/Losh.

Results

TaskDatasetMetricValueModel
VideoRef-DAVIS17F66.8LoSh
VideoRef-DAVIS17J61.8LoSh
VideoRef-DAVIS17J&F64.3LoSh
Instance SegmentationRefer-YouTube-VOS (2021 public validation)F66LoSh-R
Instance SegmentationRefer-YouTube-VOS (2021 public validation)J62.5LoSh-R
Instance SegmentationRefer-YouTube-VOS (2021 public validation)J&F64.2LoSh-R
Video Object SegmentationRef-DAVIS17F66.8LoSh
Video Object SegmentationRef-DAVIS17J61.8LoSh
Video Object SegmentationRef-DAVIS17J&F64.3LoSh
Referring Expression SegmentationRefer-YouTube-VOS (2021 public validation)F66LoSh-R
Referring Expression SegmentationRefer-YouTube-VOS (2021 public validation)J62.5LoSh-R
Referring Expression SegmentationRefer-YouTube-VOS (2021 public validation)J&F64.2LoSh-R

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17Unified Medical Image Segmentation with State Space Modeling Snake2025-07-17A Privacy-Preserving Semantic-Segmentation Method Using Domain-Adaptation Technique2025-07-17SAMST: A Transformer framework based on SAM pseudo label filtering for remote sensing semi-supervised semantic segmentation2025-07-16Tomato Multi-Angle Multi-Pose Dataset for Fine-Grained Phenotyping2025-07-15U-RWKV: Lightweight medical image segmentation with direction-adaptive RWKV2025-07-15