TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Vision-Language Transformer and Query Generation for Refer...

Vision-Language Transformer and Query Generation for Referring Segmentation

Henghui Ding, Chang Liu, Suchen Wang, Xudong Jiang

2021-08-12ICCV 2021 10Generalized Referring Expression ComprehensionGeneralized Referring Expression SegmentationReferring Expression SegmentationSegmentation
PaperPDFCode(official)

Abstract

In this work, we address the challenging task of referring segmentation. The query expression in referring segmentation typically indicates the target object by describing its relationship with others. Therefore, to find the target one among all instances in the image, the model must have a holistic understanding of the whole image. To achieve this, we reformulate referring segmentation as a direct attention problem: finding the region in the image where the query language expression is most attended to. We introduce transformer and multi-head attention to build a network with an encoder-decoder attention mechanism architecture that "queries" the given image with the language expression. Furthermore, we propose a Query Generation Module, which produces multiple sets of queries with different attention weights that represent the diversified comprehensions of the language expression from different aspects. At the same time, to find the best way from these diversified comprehensions based on visual clues, we further propose a Query Balance Module to adaptively select the output features of these queries for a better mask generation. Without bells and whistles, our approach is light-weight and achieves new state-of-the-art performance consistently on three referring segmentation datasets, RefCOCO, RefCOCO+, and G-Ref. Our code is available at https://github.com/henghuiding/Vision-Language-Transformer.

Results

TaskDatasetMetricValueModel
Instance SegmentationRefCoCo valOverall IoU65.65VLT
Instance SegmentationRefCOCOg-testOverall IoU56.65VLT (Darknet53)
Instance SegmentationRefCOCO+ valOverall IoU55.5VLT
Instance SegmentationRefCOCO+ test BOverall IoU49.36VLT
Instance SegmentationRefCOCO+ testAOverall IoU59.2VLT
Instance SegmentationRefCOCOg-valOverall IoU52.99VLT (Darknet53)
Instance SegmentationgRefCOCOcIoU52.51VLT
Instance SegmentationgRefCOCOgIoU52VLT
Referring Expression SegmentationRefCoCo valOverall IoU65.65VLT
Referring Expression SegmentationRefCOCOg-testOverall IoU56.65VLT (Darknet53)
Referring Expression SegmentationRefCOCO+ valOverall IoU55.5VLT
Referring Expression SegmentationRefCOCO+ test BOverall IoU49.36VLT
Referring Expression SegmentationRefCOCO+ testAOverall IoU59.2VLT
Referring Expression SegmentationRefCOCOg-valOverall IoU52.99VLT (Darknet53)
Referring Expression SegmentationgRefCOCOcIoU52.51VLT
Referring Expression SegmentationgRefCOCOgIoU52VLT
Generalized Referring Expression ComprehensiongRefCOCON-acc.35.2VLT
Generalized Referring Expression ComprehensiongRefCOCOPrecision@(F1=1, IoU≥0.5)36.6VLT

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21Deep Learning-Based Fetal Lung Segmentation from Diffusion-weighted MRI Images and Lung Maturity Evaluation for Fetal Growth Restriction2025-07-17DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17From Variability To Accuracy: Conditional Bernoulli Diffusion Models with Consensus-Driven Correction for Thin Structure Segmentation2025-07-17Unleashing Vision Foundation Models for Coronary Artery Segmentation: Parallel ViT-CNN Encoding and Variational Fusion2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17Unified Medical Image Segmentation with State Space Modeling Snake2025-07-17A Privacy-Preserving Semantic-Segmentation Method Using Domain-Adaptation Technique2025-07-17