Vision-Language Transformer and Query Generation for Referring Segmentation

Henghui Ding, Chang Liu, Suchen Wang, Xudong Jiang

2021-08-12ICCV 2021 10Generalized Referring Expression Comprehension Generalized Referring Expression Segmentation Referring Expression Segmentation Segmentation

Paper PDF Code(official)

Abstract

In this work, we address the challenging task of referring segmentation. The query expression in referring segmentation typically indicates the target object by describing its relationship with others. Therefore, to find the target one among all instances in the image, the model must have a holistic understanding of the whole image. To achieve this, we reformulate referring segmentation as a direct attention problem: finding the region in the image where the query language expression is most attended to. We introduce transformer and multi-head attention to build a network with an encoder-decoder attention mechanism architecture that "queries" the given image with the language expression. Furthermore, we propose a Query Generation Module, which produces multiple sets of queries with different attention weights that represent the diversified comprehensions of the language expression from different aspects. At the same time, to find the best way from these diversified comprehensions based on visual clues, we further propose a Query Balance Module to adaptively select the output features of these queries for a better mask generation. Without bells and whistles, our approach is light-weight and achieves new state-of-the-art performance consistently on three referring segmentation datasets, RefCOCO, RefCOCO+, and G-Ref. Our code is available at https://github.com/henghuiding/Vision-Language-Transformer.

Results

Task	Dataset	Metric	Value	Model
Instance Segmentation	RefCoCo val	Overall IoU	65.65	VLT
Instance Segmentation	RefCOCOg-test	Overall IoU	56.65	VLT (Darknet53)
Instance Segmentation	RefCOCO+ val	Overall IoU	55.5	VLT
Instance Segmentation	RefCOCO+ test B	Overall IoU	49.36	VLT
Instance Segmentation	RefCOCO+ testA	Overall IoU	59.2	VLT
Instance Segmentation	RefCOCOg-val	Overall IoU	52.99	VLT (Darknet53)
Instance Segmentation	gRefCOCO	cIoU	52.51	VLT
Instance Segmentation	gRefCOCO	gIoU	52	VLT
Referring Expression Segmentation	RefCoCo val	Overall IoU	65.65	VLT
Referring Expression Segmentation	RefCOCOg-test	Overall IoU	56.65	VLT (Darknet53)
Referring Expression Segmentation	RefCOCO+ val	Overall IoU	55.5	VLT
Referring Expression Segmentation	RefCOCO+ test B	Overall IoU	49.36	VLT
Referring Expression Segmentation	RefCOCO+ testA	Overall IoU	59.2	VLT
Referring Expression Segmentation	RefCOCOg-val	Overall IoU	52.99	VLT (Darknet53)
Referring Expression Segmentation	gRefCOCO	cIoU	52.51	VLT
Referring Expression Segmentation	gRefCOCO	gIoU	52	VLT
Generalized Referring Expression Comprehension	gRefCOCO	N-acc.	35.2	VLT
Generalized Referring Expression Comprehension	gRefCOCO	Precision@(F1=1, IoU≥0.5)	36.6	VLT

Vision-Language Transformer and Query Generation for Referring Segmentation

Abstract

Results

Related Papers

Vision-Language Transformer and Query Generation for Referring Segmentation

Abstract

Results

Related Papers