TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Local-Global Context Aware Transformer for Language-Guided...

Local-Global Context Aware Transformer for Language-Guided Video Segmentation

Chen Liang, Wenguan Wang, Tianfei Zhou, Jiaxu Miao, Yawei Luo, Yi Yang

2022-03-18Visual GroundingReferring Video Object SegmentationReferring Expression SegmentationSegmentationSemantic SegmentationVideo SegmentationVideo Object SegmentationVideo Semantic Segmentation
PaperPDFCode(official)

Abstract

We explore the task of language-guided video segmentation (LVS). Previous algorithms mostly adopt 3D CNNs to learn video representation, struggling to capture long-term context and easily suffering from visual-linguistic misalignment. In light of this, we present Locater (local-global context aware Transformer), which augments the Transformer architecture with a finite memory so as to query the entire video with the language expression in an efficient manner. The memory is designed to involve two components -- one for persistently preserving global video content, and one for dynamically gathering local temporal context and segmentation history. Based on the memorized local-global context and the particular content of each frame, Locater holistically and flexibly comprehends the expression as an adaptive query vector for each frame. The vector is used to query the corresponding frame for mask generation. The memory also allows Locater to process videos with linear time complexity and constant size memory, while Transformer-style self-attention computation scales quadratically with sequence length. To thoroughly examine the visual grounding capability of LVS models, we contribute a new LVS dataset, A2D-S+, which is built upon A2D-S dataset but poses increased challenges in disambiguating among similar objects. Experiments on three LVS datasets and our A2D-S+ show that Locater outperforms previous state-of-the-arts. Further, we won the 1st place in the Referring Video Object Segmentation Track of the 3rd Large-scale Video Object Segmentation Challenge, where Locater served as the foundation for the winning solution. Our code and dataset are available at: https://github.com/leonnnop/Locater

Results

TaskDatasetMetricValueModel
Instance SegmentationRefer-YouTube-VOS (2021 public validation)F51.1Locater
Instance SegmentationRefer-YouTube-VOS (2021 public validation)J48.8Locater
Instance SegmentationRefer-YouTube-VOS (2021 public validation)J&F50Locater
Instance SegmentationA2D SentencesAP0.465Locater
Instance SegmentationA2D SentencesIoU mean0.597Locater
Instance SegmentationA2D SentencesIoU overall0.69Locater
Instance SegmentationA2D SentencesPrecision@0.50.709Locater
Instance SegmentationA2D SentencesPrecision@0.60.64Locater
Instance SegmentationA2D SentencesPrecision@0.70.525Locater
Instance SegmentationA2D SentencesPrecision@0.80.351Locater
Instance SegmentationA2D SentencesPrecision@0.90.101Locater
Referring Expression SegmentationRefer-YouTube-VOS (2021 public validation)F51.1Locater
Referring Expression SegmentationRefer-YouTube-VOS (2021 public validation)J48.8Locater
Referring Expression SegmentationRefer-YouTube-VOS (2021 public validation)J&F50Locater
Referring Expression SegmentationA2D SentencesAP0.465Locater
Referring Expression SegmentationA2D SentencesIoU mean0.597Locater
Referring Expression SegmentationA2D SentencesIoU overall0.69Locater
Referring Expression SegmentationA2D SentencesPrecision@0.50.709Locater
Referring Expression SegmentationA2D SentencesPrecision@0.60.64Locater
Referring Expression SegmentationA2D SentencesPrecision@0.70.525Locater
Referring Expression SegmentationA2D SentencesPrecision@0.80.351Locater
Referring Expression SegmentationA2D SentencesPrecision@0.90.101Locater

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21Deep Learning-Based Fetal Lung Segmentation from Diffusion-weighted MRI Images and Lung Maturity Evaluation for Fetal Growth Restriction2025-07-17DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17From Variability To Accuracy: Conditional Bernoulli Diffusion Models with Consensus-Driven Correction for Thin Structure Segmentation2025-07-17Unleashing Vision Foundation Models for Coronary Artery Segmentation: Parallel ViT-CNN Encoding and Variational Fusion2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17Unified Medical Image Segmentation with State Space Modeling Snake2025-07-17A Privacy-Preserving Semantic-Segmentation Method Using Domain-Adaptation Technique2025-07-17