TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/VLT: Vision-Language Transformer and Query Generation for ...

VLT: Vision-Language Transformer and Query Generation for Referring Segmentation

Henghui Ding, Chang Liu, Suchen Wang, Xudong Jiang

2022-10-28Referring Video Object SegmentationReferring Expression SegmentationVideo Object Segmentation
PaperPDFCode(official)

Abstract

We propose a Vision-Language Transformer (VLT) framework for referring segmentation to facilitate deep interactions among multi-modal information and enhance the holistic understanding to vision-language features. There are different ways to understand the dynamic emphasis of a language expression, especially when interacting with the image. However, the learned queries in existing transformer works are fixed after training, which cannot cope with the randomness and huge diversity of the language expressions. To address this issue, we propose a Query Generation Module, which dynamically produces multiple sets of input-specific queries to represent the diverse comprehensions of language expression. To find the best among these diverse comprehensions, so as to generate a better mask, we propose a Query Balance Module to selectively fuse the corresponding responses of the set of queries. Furthermore, to enhance the model's ability in dealing with diverse language expressions, we consider inter-sample learning to explicitly endow the model with knowledge of understanding different language expressions to the same object. We introduce masked contrastive learning to narrow down the features of different expressions for the same target object while distinguishing the features of different objects. The proposed approach is lightweight and achieves new state-of-the-art referring segmentation results consistently on five datasets.

Results

TaskDatasetMetricValueModel
VideoMeViSF37.3VLT+TC
VideoMeViSJ33.6VLT+TC
VideoMeViSJ&F35.5VLT+TC
VideoRefer-YouTube-VOSF65.6VLT
VideoRefer-YouTube-VOSJ61.9VLT
VideoRefer-YouTube-VOSJ&F63.8VLT
Instance SegmentationRefCoCo valOverall IoU72.96VLT
Instance SegmentationRefer-YouTube-VOS (2021 public validation)F65.6VLT
Instance SegmentationRefer-YouTube-VOS (2021 public validation)J61.9VLT
Instance SegmentationRefer-YouTube-VOS (2021 public validation)J&F63.8VLT
Instance SegmentationRefCOCO+ valOverall IoU63.53VLT
Instance SegmentationRefCOCO+ test BOverall IoU56.92VLT
Instance SegmentationRefCOCO+ testAOverall IoU68.43VLT
Instance SegmentationRefCOCOg-valOverall IoU63.49VLT (Swin-B)
Video Object SegmentationMeViSF37.3VLT+TC
Video Object SegmentationMeViSJ33.6VLT+TC
Video Object SegmentationMeViSJ&F35.5VLT+TC
Video Object SegmentationRefer-YouTube-VOSF65.6VLT
Video Object SegmentationRefer-YouTube-VOSJ61.9VLT
Video Object SegmentationRefer-YouTube-VOSJ&F63.8VLT
Referring Expression SegmentationRefCoCo valOverall IoU72.96VLT
Referring Expression SegmentationRefer-YouTube-VOS (2021 public validation)F65.6VLT
Referring Expression SegmentationRefer-YouTube-VOS (2021 public validation)J61.9VLT
Referring Expression SegmentationRefer-YouTube-VOS (2021 public validation)J&F63.8VLT
Referring Expression SegmentationRefCOCO+ valOverall IoU63.53VLT
Referring Expression SegmentationRefCOCO+ test BOverall IoU56.92VLT
Referring Expression SegmentationRefCOCO+ testAOverall IoU68.43VLT
Referring Expression SegmentationRefCOCOg-valOverall IoU63.49VLT (Swin-B)

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21Memory-Augmented SAM2 for Training-Free Surgical Video Segmentation2025-07-13MUVOD: A Novel Multi-view Video Object Segmentation Dataset and A Benchmark for 3D Segmentation2025-07-10DeRIS: Decoupling Perception and Cognition for Enhanced Referring Image Segmentation through Loopback Synergy2025-07-02Mask-aware Text-to-Image Retrieval: Referring Expression Segmentation Meets Cross-modal Retrieval2025-06-28M^3-VOS: Multi-Phase, Multi-Transition, and Multi-Scenery Video Object Segmentation2025-06-15THU-Warwick Submission for EPIC-KITCHEN Challenge 2025: Semi-Supervised Video Object Segmentation2025-06-07VideoMolmo: Spatio-Temporal Grounding Meets Pointing2025-06-05