TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Cost Aggregation with 4D Convolutional Swin Transformer fo...

Cost Aggregation with 4D Convolutional Swin Transformer for Few-Shot Segmentation

Sunghwan Hong, Seokju Cho, Jisu Nam, Stephen Lin, Seungryong Kim

2022-07-22Semantic correspondenceFew-Shot Semantic Segmentation
PaperPDFCode(official)

Abstract

This paper presents a novel cost aggregation network, called Volumetric Aggregation with Transformers (VAT), for few-shot segmentation. The use of transformers can benefit correlation map aggregation through self-attention over a global receptive field. However, the tokenization of a correlation map for transformer processing can be detrimental, because the discontinuity at token boundaries reduces the local context available near the token edges and decreases inductive bias. To address this problem, we propose a 4D Convolutional Swin Transformer, where a high-dimensional Swin Transformer is preceded by a series of small-kernel convolutions that impart local context to all pixels and introduce convolutional inductive bias. We additionally boost aggregation performance by applying transformers within a pyramidal structure, where aggregation at a coarser level guides aggregation at a finer level. Noise in the transformer output is then filtered in the subsequent decoder with the help of the query's appearance embedding. With this model, a new state-of-the-art is set for all the standard benchmarks in few-shot segmentation. It is shown that VAT attains state-of-the-art performance for semantic correspondence as well, where cost aggregation also plays a central role.

Results

TaskDatasetMetricValueModel
Few-Shot LearningFSS-1000 (5-shot)FB-IoU94.4VAT (ResNet-101)
Few-Shot LearningFSS-1000 (5-shot)Mean IoU90.8VAT (ResNet-101)
Few-Shot LearningFSS-1000 (5-shot)FB-IoU94.2VAT (ResNet-50)
Few-Shot LearningFSS-1000 (5-shot)Mean IoU90.7VAT (ResNet-50)
Few-Shot LearningCOCO-20i (5-shot)FB-IoU72.4VAT (ResNet-101)
Few-Shot LearningCOCO-20i (5-shot)Mean IoU47.9VAT (ResNet-101)
Few-Shot LearningFSS-1000 (1-shot)FB-IoU94VAT (ResNet-101)
Few-Shot LearningFSS-1000 (1-shot)Mean IoU90.3VAT (ResNet-101)
Few-Shot LearningFSS-1000 (1-shot)FB-IoU93.8VAT (ResNet-50)
Few-Shot LearningFSS-1000 (1-shot)Mean IoU90.1VAT (ResNet-50)
Few-Shot LearningPASCAL-5i (1-Shot)FB-IoU79.6VAT (ResNet-101)
Few-Shot LearningPASCAL-5i (1-Shot)Mean IoU67.9VAT (ResNet-101)
Few-Shot LearningPASCAL-5i (1-Shot)FB-IoU77.8VAT (ResNet-50)
Few-Shot LearningPASCAL-5i (1-Shot)Mean IoU65.5VAT (ResNet-50)
Few-Shot LearningCOCO-20i (1-shot)FB-IoU68.8VAT (ResNet-101)
Few-Shot LearningCOCO-20i (1-shot)Mean IoU41.3VAT (ResNet-101)
Few-Shot LearningPASCAL-5i (5-Shot)FB-IoU83.2VAT (ResNet-101)
Few-Shot LearningPASCAL-5i (5-Shot)Mean IoU72VAT (ResNet-101)
Few-Shot LearningPASCAL-5i (5-Shot)FB-IoU80.9VAT (ResNet-50)
Few-Shot LearningPASCAL-5i (5-Shot)Mean IoU70.1VAT (ResNet-50)
Image MatchingSPair-71kPCK55.5VAT (ECCV)
Image MatchingPF-PASCALPCK92.3VAT (ECCV)
Image MatchingPF-WILLOWPCK81.6VAT (ECCV)
Few-Shot Semantic SegmentationFSS-1000 (5-shot)FB-IoU94.4VAT (ResNet-101)
Few-Shot Semantic SegmentationFSS-1000 (5-shot)Mean IoU90.8VAT (ResNet-101)
Few-Shot Semantic SegmentationFSS-1000 (5-shot)FB-IoU94.2VAT (ResNet-50)
Few-Shot Semantic SegmentationFSS-1000 (5-shot)Mean IoU90.7VAT (ResNet-50)
Few-Shot Semantic SegmentationCOCO-20i (5-shot)FB-IoU72.4VAT (ResNet-101)
Few-Shot Semantic SegmentationCOCO-20i (5-shot)Mean IoU47.9VAT (ResNet-101)
Few-Shot Semantic SegmentationFSS-1000 (1-shot)FB-IoU94VAT (ResNet-101)
Few-Shot Semantic SegmentationFSS-1000 (1-shot)Mean IoU90.3VAT (ResNet-101)
Few-Shot Semantic SegmentationFSS-1000 (1-shot)FB-IoU93.8VAT (ResNet-50)
Few-Shot Semantic SegmentationFSS-1000 (1-shot)Mean IoU90.1VAT (ResNet-50)
Few-Shot Semantic SegmentationPASCAL-5i (1-Shot)FB-IoU79.6VAT (ResNet-101)
Few-Shot Semantic SegmentationPASCAL-5i (1-Shot)Mean IoU67.9VAT (ResNet-101)
Few-Shot Semantic SegmentationPASCAL-5i (1-Shot)FB-IoU77.8VAT (ResNet-50)
Few-Shot Semantic SegmentationPASCAL-5i (1-Shot)Mean IoU65.5VAT (ResNet-50)
Few-Shot Semantic SegmentationCOCO-20i (1-shot)FB-IoU68.8VAT (ResNet-101)
Few-Shot Semantic SegmentationCOCO-20i (1-shot)Mean IoU41.3VAT (ResNet-101)
Few-Shot Semantic SegmentationPASCAL-5i (5-Shot)FB-IoU83.2VAT (ResNet-101)
Few-Shot Semantic SegmentationPASCAL-5i (5-Shot)Mean IoU72VAT (ResNet-101)
Few-Shot Semantic SegmentationPASCAL-5i (5-Shot)FB-IoU80.9VAT (ResNet-50)
Few-Shot Semantic SegmentationPASCAL-5i (5-Shot)Mean IoU70.1VAT (ResNet-50)
Meta-LearningFSS-1000 (5-shot)FB-IoU94.4VAT (ResNet-101)
Meta-LearningFSS-1000 (5-shot)Mean IoU90.8VAT (ResNet-101)
Meta-LearningFSS-1000 (5-shot)FB-IoU94.2VAT (ResNet-50)
Meta-LearningFSS-1000 (5-shot)Mean IoU90.7VAT (ResNet-50)
Meta-LearningCOCO-20i (5-shot)FB-IoU72.4VAT (ResNet-101)
Meta-LearningCOCO-20i (5-shot)Mean IoU47.9VAT (ResNet-101)
Meta-LearningFSS-1000 (1-shot)FB-IoU94VAT (ResNet-101)
Meta-LearningFSS-1000 (1-shot)Mean IoU90.3VAT (ResNet-101)
Meta-LearningFSS-1000 (1-shot)FB-IoU93.8VAT (ResNet-50)
Meta-LearningFSS-1000 (1-shot)Mean IoU90.1VAT (ResNet-50)
Meta-LearningPASCAL-5i (1-Shot)FB-IoU79.6VAT (ResNet-101)
Meta-LearningPASCAL-5i (1-Shot)Mean IoU67.9VAT (ResNet-101)
Meta-LearningPASCAL-5i (1-Shot)FB-IoU77.8VAT (ResNet-50)
Meta-LearningPASCAL-5i (1-Shot)Mean IoU65.5VAT (ResNet-50)
Meta-LearningCOCO-20i (1-shot)FB-IoU68.8VAT (ResNet-101)
Meta-LearningCOCO-20i (1-shot)Mean IoU41.3VAT (ResNet-101)
Meta-LearningPASCAL-5i (5-Shot)FB-IoU83.2VAT (ResNet-101)
Meta-LearningPASCAL-5i (5-Shot)Mean IoU72VAT (ResNet-101)
Meta-LearningPASCAL-5i (5-Shot)FB-IoU80.9VAT (ResNet-50)
Meta-LearningPASCAL-5i (5-Shot)Mean IoU70.1VAT (ResNet-50)
Semantic correspondenceSPair-71kPCK55.5VAT (ECCV)
Semantic correspondencePF-PASCALPCK92.3VAT (ECCV)
Semantic correspondencePF-WILLOWPCK81.6VAT (ECCV)

Related Papers

RL from Physical Feedback: Aligning Large Motion Models with Humanoid Control2025-06-15Jamais Vu: Exposing the Generalization Gap in Supervised Semantic Correspondence2025-06-09Adapter Naturally Serves as Decoupler for Cross-Domain Few-Shot Semantic Segmentation2025-06-09Do It Yourself: Learning Semantic Correspondence from Pseudo-Labels2025-06-05MotionRAG-Diff: A Retrieval-Augmented Diffusion Framework for Long-Term Music-to-Dance Generation2025-06-03Cora: Correspondence-aware image editing using few step diffusion2025-05-29Semantic Correspondence: Unified Benchmarking and a Strong Baseline2025-05-23DINOv2-powered Few-Shot Semantic Segmentation: A Unified Framework via Cross-Model Distillation and 4D Correlation Mining2025-04-22