Cost Aggregation with 4D Convolutional Swin Transformer for Few-Shot Segmentation

Sunghwan Hong, Seokju Cho, Jisu Nam, Stephen Lin, Seungryong Kim

2022-07-22Semantic correspondence Few-Shot Semantic Segmentation

Abstract

This paper presents a novel cost aggregation network, called Volumetric Aggregation with Transformers (VAT), for few-shot segmentation. The use of transformers can benefit correlation map aggregation through self-attention over a global receptive field. However, the tokenization of a correlation map for transformer processing can be detrimental, because the discontinuity at token boundaries reduces the local context available near the token edges and decreases inductive bias. To address this problem, we propose a 4D Convolutional Swin Transformer, where a high-dimensional Swin Transformer is preceded by a series of small-kernel convolutions that impart local context to all pixels and introduce convolutional inductive bias. We additionally boost aggregation performance by applying transformers within a pyramidal structure, where aggregation at a coarser level guides aggregation at a finer level. Noise in the transformer output is then filtered in the subsequent decoder with the help of the query's appearance embedding. With this model, a new state-of-the-art is set for all the standard benchmarks in few-shot segmentation. It is shown that VAT attains state-of-the-art performance for semantic correspondence as well, where cost aggregation also plays a central role.

Results

Task	Dataset	Metric	Value	Model
Few-Shot Learning	FSS-1000 (5-shot)	FB-IoU	94.4	VAT (ResNet-101)
Few-Shot Learning	FSS-1000 (5-shot)	Mean IoU	90.8	VAT (ResNet-101)
Few-Shot Learning	FSS-1000 (5-shot)	FB-IoU	94.2	VAT (ResNet-50)
Few-Shot Learning	FSS-1000 (5-shot)	Mean IoU	90.7	VAT (ResNet-50)
Few-Shot Learning	COCO-20i (5-shot)	FB-IoU	72.4	VAT (ResNet-101)
Few-Shot Learning	COCO-20i (5-shot)	Mean IoU	47.9	VAT (ResNet-101)
Few-Shot Learning	FSS-1000 (1-shot)	FB-IoU	94	VAT (ResNet-101)
Few-Shot Learning	FSS-1000 (1-shot)	Mean IoU	90.3	VAT (ResNet-101)
Few-Shot Learning	FSS-1000 (1-shot)	FB-IoU	93.8	VAT (ResNet-50)
Few-Shot Learning	FSS-1000 (1-shot)	Mean IoU	90.1	VAT (ResNet-50)
Few-Shot Learning	PASCAL-5i (1-Shot)	FB-IoU	79.6	VAT (ResNet-101)
Few-Shot Learning	PASCAL-5i (1-Shot)	Mean IoU	67.9	VAT (ResNet-101)
Few-Shot Learning	PASCAL-5i (1-Shot)	FB-IoU	77.8	VAT (ResNet-50)
Few-Shot Learning	PASCAL-5i (1-Shot)	Mean IoU	65.5	VAT (ResNet-50)
Few-Shot Learning	COCO-20i (1-shot)	FB-IoU	68.8	VAT (ResNet-101)
Few-Shot Learning	COCO-20i (1-shot)	Mean IoU	41.3	VAT (ResNet-101)
Few-Shot Learning	PASCAL-5i (5-Shot)	FB-IoU	83.2	VAT (ResNet-101)
Few-Shot Learning	PASCAL-5i (5-Shot)	Mean IoU	72	VAT (ResNet-101)
Few-Shot Learning	PASCAL-5i (5-Shot)	FB-IoU	80.9	VAT (ResNet-50)
Few-Shot Learning	PASCAL-5i (5-Shot)	Mean IoU	70.1	VAT (ResNet-50)
Image Matching	SPair-71k	PCK	55.5	VAT (ECCV)
Image Matching	PF-PASCAL	PCK	92.3	VAT (ECCV)
Image Matching	PF-WILLOW	PCK	81.6	VAT (ECCV)
Few-Shot Semantic Segmentation	FSS-1000 (5-shot)	FB-IoU	94.4	VAT (ResNet-101)
Few-Shot Semantic Segmentation	FSS-1000 (5-shot)	Mean IoU	90.8	VAT (ResNet-101)
Few-Shot Semantic Segmentation	FSS-1000 (5-shot)	FB-IoU	94.2	VAT (ResNet-50)
Few-Shot Semantic Segmentation	FSS-1000 (5-shot)	Mean IoU	90.7	VAT (ResNet-50)
Few-Shot Semantic Segmentation	COCO-20i (5-shot)	FB-IoU	72.4	VAT (ResNet-101)
Few-Shot Semantic Segmentation	COCO-20i (5-shot)	Mean IoU	47.9	VAT (ResNet-101)
Few-Shot Semantic Segmentation	FSS-1000 (1-shot)	FB-IoU	94	VAT (ResNet-101)
Few-Shot Semantic Segmentation	FSS-1000 (1-shot)	Mean IoU	90.3	VAT (ResNet-101)
Few-Shot Semantic Segmentation	FSS-1000 (1-shot)	FB-IoU	93.8	VAT (ResNet-50)
Few-Shot Semantic Segmentation	FSS-1000 (1-shot)	Mean IoU	90.1	VAT (ResNet-50)
Few-Shot Semantic Segmentation	PASCAL-5i (1-Shot)	FB-IoU	79.6	VAT (ResNet-101)
Few-Shot Semantic Segmentation	PASCAL-5i (1-Shot)	Mean IoU	67.9	VAT (ResNet-101)
Few-Shot Semantic Segmentation	PASCAL-5i (1-Shot)	FB-IoU	77.8	VAT (ResNet-50)
Few-Shot Semantic Segmentation	PASCAL-5i (1-Shot)	Mean IoU	65.5	VAT (ResNet-50)
Few-Shot Semantic Segmentation	COCO-20i (1-shot)	FB-IoU	68.8	VAT (ResNet-101)
Few-Shot Semantic Segmentation	COCO-20i (1-shot)	Mean IoU	41.3	VAT (ResNet-101)
Few-Shot Semantic Segmentation	PASCAL-5i (5-Shot)	FB-IoU	83.2	VAT (ResNet-101)
Few-Shot Semantic Segmentation	PASCAL-5i (5-Shot)	Mean IoU	72	VAT (ResNet-101)
Few-Shot Semantic Segmentation	PASCAL-5i (5-Shot)	FB-IoU	80.9	VAT (ResNet-50)
Few-Shot Semantic Segmentation	PASCAL-5i (5-Shot)	Mean IoU	70.1	VAT (ResNet-50)
Meta-Learning	FSS-1000 (5-shot)	FB-IoU	94.4	VAT (ResNet-101)
Meta-Learning	FSS-1000 (5-shot)	Mean IoU	90.8	VAT (ResNet-101)
Meta-Learning	FSS-1000 (5-shot)	FB-IoU	94.2	VAT (ResNet-50)
Meta-Learning	FSS-1000 (5-shot)	Mean IoU	90.7	VAT (ResNet-50)
Meta-Learning	COCO-20i (5-shot)	FB-IoU	72.4	VAT (ResNet-101)
Meta-Learning	COCO-20i (5-shot)	Mean IoU	47.9	VAT (ResNet-101)
Meta-Learning	FSS-1000 (1-shot)	FB-IoU	94	VAT (ResNet-101)
Meta-Learning	FSS-1000 (1-shot)	Mean IoU	90.3	VAT (ResNet-101)
Meta-Learning	FSS-1000 (1-shot)	FB-IoU	93.8	VAT (ResNet-50)
Meta-Learning	FSS-1000 (1-shot)	Mean IoU	90.1	VAT (ResNet-50)
Meta-Learning	PASCAL-5i (1-Shot)	FB-IoU	79.6	VAT (ResNet-101)
Meta-Learning	PASCAL-5i (1-Shot)	Mean IoU	67.9	VAT (ResNet-101)
Meta-Learning	PASCAL-5i (1-Shot)	FB-IoU	77.8	VAT (ResNet-50)
Meta-Learning	PASCAL-5i (1-Shot)	Mean IoU	65.5	VAT (ResNet-50)
Meta-Learning	COCO-20i (1-shot)	FB-IoU	68.8	VAT (ResNet-101)
Meta-Learning	COCO-20i (1-shot)	Mean IoU	41.3	VAT (ResNet-101)
Meta-Learning	PASCAL-5i (5-Shot)	FB-IoU	83.2	VAT (ResNet-101)
Meta-Learning	PASCAL-5i (5-Shot)	Mean IoU	72	VAT (ResNet-101)
Meta-Learning	PASCAL-5i (5-Shot)	FB-IoU	80.9	VAT (ResNet-50)
Meta-Learning	PASCAL-5i (5-Shot)	Mean IoU	70.1	VAT (ResNet-50)
Semantic correspondence	SPair-71k	PCK	55.5	VAT (ECCV)
Semantic correspondence	PF-PASCAL	PCK	92.3	VAT (ECCV)
Semantic correspondence	PF-WILLOW	PCK	81.6	VAT (ECCV)

Cost Aggregation with 4D Convolutional Swin Transformer for Few-Shot Segmentation

Abstract

Results

Related Papers

Cost Aggregation with 4D Convolutional Swin Transformer for Few-Shot Segmentation

Abstract

Results

Related Papers