Local-Global Context Aware Transformer for Language-Guided Video Segmentation

Chen Liang, Wenguan Wang, Tianfei Zhou, Jiaxu Miao, Yawei Luo, Yi Yang

2022-03-18Visual Grounding Referring Video Object Segmentation Referring Expression Segmentation Segmentation Semantic Segmentation Video Segmentation Video Object Segmentation Video Semantic Segmentation

Paper PDF Code(official)

Abstract

We explore the task of language-guided video segmentation (LVS). Previous algorithms mostly adopt 3D CNNs to learn video representation, struggling to capture long-term context and easily suffering from visual-linguistic misalignment. In light of this, we present Locater (local-global context aware Transformer), which augments the Transformer architecture with a finite memory so as to query the entire video with the language expression in an efficient manner. The memory is designed to involve two components -- one for persistently preserving global video content, and one for dynamically gathering local temporal context and segmentation history. Based on the memorized local-global context and the particular content of each frame, Locater holistically and flexibly comprehends the expression as an adaptive query vector for each frame. The vector is used to query the corresponding frame for mask generation. The memory also allows Locater to process videos with linear time complexity and constant size memory, while Transformer-style self-attention computation scales quadratically with sequence length. To thoroughly examine the visual grounding capability of LVS models, we contribute a new LVS dataset, A2D-S+, which is built upon A2D-S dataset but poses increased challenges in disambiguating among similar objects. Experiments on three LVS datasets and our A2D-S+ show that Locater outperforms previous state-of-the-arts. Further, we won the 1st place in the Referring Video Object Segmentation Track of the 3rd Large-scale Video Object Segmentation Challenge, where Locater served as the foundation for the winning solution. Our code and dataset are available at: https://github.com/leonnnop/Locater

Results

Task	Dataset	Metric	Value	Model
Instance Segmentation	Refer-YouTube-VOS (2021 public validation)	F	51.1	Locater
Instance Segmentation	Refer-YouTube-VOS (2021 public validation)	J	48.8	Locater
Instance Segmentation	Refer-YouTube-VOS (2021 public validation)	J&F	50	Locater
Instance Segmentation	A2D Sentences	AP	0.465	Locater
Instance Segmentation	A2D Sentences	IoU mean	0.597	Locater
Instance Segmentation	A2D Sentences	IoU overall	0.69	Locater
Instance Segmentation	A2D Sentences	Precision@0.5	0.709	Locater
Instance Segmentation	A2D Sentences	Precision@0.6	0.64	Locater
Instance Segmentation	A2D Sentences	Precision@0.7	0.525	Locater
Instance Segmentation	A2D Sentences	Precision@0.8	0.351	Locater
Instance Segmentation	A2D Sentences	Precision@0.9	0.101	Locater
Referring Expression Segmentation	Refer-YouTube-VOS (2021 public validation)	F	51.1	Locater
Referring Expression Segmentation	Refer-YouTube-VOS (2021 public validation)	J	48.8	Locater
Referring Expression Segmentation	Refer-YouTube-VOS (2021 public validation)	J&F	50	Locater
Referring Expression Segmentation	A2D Sentences	AP	0.465	Locater
Referring Expression Segmentation	A2D Sentences	IoU mean	0.597	Locater
Referring Expression Segmentation	A2D Sentences	IoU overall	0.69	Locater
Referring Expression Segmentation	A2D Sentences	Precision@0.5	0.709	Locater
Referring Expression Segmentation	A2D Sentences	Precision@0.6	0.64	Locater
Referring Expression Segmentation	A2D Sentences	Precision@0.7	0.525	Locater
Referring Expression Segmentation	A2D Sentences	Precision@0.8	0.351	Locater
Referring Expression Segmentation	A2D Sentences	Precision@0.9	0.101	Locater

Local-Global Context Aware Transformer for Language-Guided Video Segmentation

Abstract

Results

Related Papers

Local-Global Context Aware Transformer for Language-Guided Video Segmentation

Abstract

Results

Related Papers