VLT: Vision-Language Transformer and Query Generation for Referring Segmentation

Henghui Ding, Chang Liu, Suchen Wang, Xudong Jiang

2022-10-28Referring Video Object Segmentation Referring Expression Segmentation Video Object Segmentation

Abstract

We propose a Vision-Language Transformer (VLT) framework for referring segmentation to facilitate deep interactions among multi-modal information and enhance the holistic understanding to vision-language features. There are different ways to understand the dynamic emphasis of a language expression, especially when interacting with the image. However, the learned queries in existing transformer works are fixed after training, which cannot cope with the randomness and huge diversity of the language expressions. To address this issue, we propose a Query Generation Module, which dynamically produces multiple sets of input-specific queries to represent the diverse comprehensions of language expression. To find the best among these diverse comprehensions, so as to generate a better mask, we propose a Query Balance Module to selectively fuse the corresponding responses of the set of queries. Furthermore, to enhance the model's ability in dealing with diverse language expressions, we consider inter-sample learning to explicitly endow the model with knowledge of understanding different language expressions to the same object. We introduce masked contrastive learning to narrow down the features of different expressions for the same target object while distinguishing the features of different objects. The proposed approach is lightweight and achieves new state-of-the-art referring segmentation results consistently on five datasets.

Results

Task	Dataset	Metric	Value	Model
Video	MeViS	F	37.3	VLT+TC
Video	MeViS	J	33.6	VLT+TC
Video	MeViS	J&F	35.5	VLT+TC
Video	Refer-YouTube-VOS	F	65.6	VLT
Video	Refer-YouTube-VOS	J	61.9	VLT
Video	Refer-YouTube-VOS	J&F	63.8	VLT
Instance Segmentation	RefCoCo val	Overall IoU	72.96	VLT
Instance Segmentation	Refer-YouTube-VOS (2021 public validation)	F	65.6	VLT
Instance Segmentation	Refer-YouTube-VOS (2021 public validation)	J	61.9	VLT
Instance Segmentation	Refer-YouTube-VOS (2021 public validation)	J&F	63.8	VLT
Instance Segmentation	RefCOCO+ val	Overall IoU	63.53	VLT
Instance Segmentation	RefCOCO+ test B	Overall IoU	56.92	VLT
Instance Segmentation	RefCOCO+ testA	Overall IoU	68.43	VLT
Instance Segmentation	RefCOCOg-val	Overall IoU	63.49	VLT (Swin-B)
Video Object Segmentation	MeViS	F	37.3	VLT+TC
Video Object Segmentation	MeViS	J	33.6	VLT+TC
Video Object Segmentation	MeViS	J&F	35.5	VLT+TC
Video Object Segmentation	Refer-YouTube-VOS	F	65.6	VLT
Video Object Segmentation	Refer-YouTube-VOS	J	61.9	VLT
Video Object Segmentation	Refer-YouTube-VOS	J&F	63.8	VLT
Referring Expression Segmentation	RefCoCo val	Overall IoU	72.96	VLT
Referring Expression Segmentation	Refer-YouTube-VOS (2021 public validation)	F	65.6	VLT
Referring Expression Segmentation	Refer-YouTube-VOS (2021 public validation)	J	61.9	VLT
Referring Expression Segmentation	Refer-YouTube-VOS (2021 public validation)	J&F	63.8	VLT
Referring Expression Segmentation	RefCOCO+ val	Overall IoU	63.53	VLT
Referring Expression Segmentation	RefCOCO+ test B	Overall IoU	56.92	VLT
Referring Expression Segmentation	RefCOCO+ testA	Overall IoU	68.43	VLT
Referring Expression Segmentation	RefCOCOg-val	Overall IoU	63.49	VLT (Swin-B)

VLT: Vision-Language Transformer and Query Generation for Referring Segmentation

Abstract

Results

Related Papers

VLT: Vision-Language Transformer and Query Generation for Referring Segmentation

Abstract

Results

Related Papers