TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Deeply Interleaved Two-Stream Encoder for Referring Video ...

Deeply Interleaved Two-Stream Encoder for Referring Video Segmentation

Guang Feng, Lihe Zhang, Zhiwei Hu, Huchuan Lu

2022-03-30Referring Expression SegmentationVideo SegmentationVideo Semantic SegmentationVocal Bursts Valence Prediction
PaperPDF

Abstract

Referring video segmentation aims to segment the corresponding video object described by the language expression. To address this task, we first design a two-stream encoder to extract CNN-based visual features and transformer-based linguistic features hierarchically, and a vision-language mutual guidance (VLMG) module is inserted into the encoder multiple times to promote the hierarchical and progressive fusion of multi-modal features. Compared with the existing multi-modal fusion methods, this two-stream encoder takes into account the multi-granularity linguistic context, and realizes the deep interleaving between modalities with the help of VLGM. In order to promote the temporal alignment between frames, we further propose a language-guided multi-scale dynamic filtering (LMDF) module to strengthen the temporal coherence, which uses the language-guided spatial-temporal features to generate a set of position-specific dynamic filters to more flexibly and effectively update the feature of current frame. Extensive experiments on four datasets verify the effectiveness of the proposed model.

Results

TaskDatasetMetricValueModel
Instance SegmentationRefer-YouTube-VOS (2021 public validation)F50.67VLIDE
Instance SegmentationRefer-YouTube-VOS (2021 public validation)J48.44VLIDE
Instance SegmentationRefer-YouTube-VOS (2021 public validation)J&F49.56VLIDE
Instance SegmentationA2D SentencesAP0.469VLIDE
Instance SegmentationA2D SentencesIoU mean0.598VLIDE
Instance SegmentationA2D SentencesIoU overall0.714VLIDE
Instance SegmentationA2D SentencesPrecision@0.50.702VLIDE
Instance SegmentationA2D SentencesPrecision@0.60.663VLIDE
Instance SegmentationA2D SentencesPrecision@0.70.585VLIDE
Instance SegmentationA2D SentencesPrecision@0.80.428VLIDE
Instance SegmentationA2D SentencesPrecision@0.90.151VLIDE
Instance SegmentationJ-HMDBAP0.441VLIDE
Instance SegmentationJ-HMDBIoU mean0.666VLIDE
Instance SegmentationJ-HMDBIoU overall0.68VLIDE
Instance SegmentationJ-HMDBPrecision@0.50.874VLIDE
Instance SegmentationJ-HMDBPrecision@0.60.791VLIDE
Instance SegmentationJ-HMDBPrecision@0.70.586VLIDE
Instance SegmentationJ-HMDBPrecision@0.80.182VLIDE
Instance SegmentationJ-HMDBPrecision@0.90.3VLIDE
Referring Expression SegmentationRefer-YouTube-VOS (2021 public validation)F50.67VLIDE
Referring Expression SegmentationRefer-YouTube-VOS (2021 public validation)J48.44VLIDE
Referring Expression SegmentationRefer-YouTube-VOS (2021 public validation)J&F49.56VLIDE
Referring Expression SegmentationA2D SentencesAP0.469VLIDE
Referring Expression SegmentationA2D SentencesIoU mean0.598VLIDE
Referring Expression SegmentationA2D SentencesIoU overall0.714VLIDE
Referring Expression SegmentationA2D SentencesPrecision@0.50.702VLIDE
Referring Expression SegmentationA2D SentencesPrecision@0.60.663VLIDE
Referring Expression SegmentationA2D SentencesPrecision@0.70.585VLIDE
Referring Expression SegmentationA2D SentencesPrecision@0.80.428VLIDE
Referring Expression SegmentationA2D SentencesPrecision@0.90.151VLIDE
Referring Expression SegmentationJ-HMDBAP0.441VLIDE
Referring Expression SegmentationJ-HMDBIoU mean0.666VLIDE
Referring Expression SegmentationJ-HMDBIoU overall0.68VLIDE
Referring Expression SegmentationJ-HMDBPrecision@0.50.874VLIDE
Referring Expression SegmentationJ-HMDBPrecision@0.60.791VLIDE
Referring Expression SegmentationJ-HMDBPrecision@0.70.586VLIDE
Referring Expression SegmentationJ-HMDBPrecision@0.80.182VLIDE
Referring Expression SegmentationJ-HMDBPrecision@0.90.3VLIDE

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21Memory-Augmented SAM2 for Training-Free Surgical Video Segmentation2025-07-13MUVOD: A Novel Multi-view Video Object Segmentation Dataset and A Benchmark for 3D Segmentation2025-07-10DeRIS: Decoupling Perception and Cognition for Enhanced Referring Image Segmentation through Loopback Synergy2025-07-02Mask-aware Text-to-Image Retrieval: Referring Expression Segmentation Meets Cross-modal Retrieval2025-06-28Decoupled Seg Tokens Make Stronger Reasoning Video Segmenter and Grounder2025-06-28CogGen: A Learner-Centered Generative AI Architecture for Intelligent Tutoring with Programming Video2025-06-25Leader360V: The Large-scale, Real-world 360 Video Dataset for Multi-task Learning in Diverse Environment2025-06-17