TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Max Pooling with Vision Transformers reconciles class and ...

Max Pooling with Vision Transformers reconciles class and shape in weakly supervised semantic segmentation

Simone Rossetti, Damiano Zappia, Marta Sanzari, Marco Schaerf, Fiora Pirri

2022-10-31Weakly-Supervised Object SegmentationWeakly-Supervised Semantic SegmentationWeakly Supervised Object DetectionSelf-Supervised LearningWeakly supervised Semantic SegmentationSemantic SegmentationWeakly-Supervised Object LocalizationWeakly supervised segmentation
PaperPDFCode(official)

Abstract

Weakly Supervised Semantic Segmentation (WSSS) research has explored many directions to improve the typical pipeline CNN plus class activation maps (CAM) plus refinements, given the image-class label as the only supervision. Though the gap with the fully supervised methods is reduced, further abating the spread seems unlikely within this framework. On the other hand, WSSS methods based on Vision Transformers (ViT) have not yet explored valid alternatives to CAM. ViT features have been shown to retain a scene layout, and object boundaries in self-supervised learning. To confirm these findings, we prove that the advantages of transformers in self-supervised methods are further strengthened by Global Max Pooling (GMP), which can leverage patch features to negotiate pixel-label probability with class probability. This work proposes a new WSSS method dubbed ViT-PCM (ViT Patch-Class Mapping), not based on CAM. The end-to-end presented network learns with a single optimization process, refined shape and proper localization for segmentation masks. Our model outperforms the state-of-the-art on baseline pseudo-masks (BPM), where we achieve $69.3\%$ mIoU on PascalVOC 2012 $val$ set. We show that our approach has the least set of parameters, though obtaining higher accuracy than all other approaches. In a sentence, quantitative and qualitative results of our method reveal that ViT-PCM is an excellent alternative to CNN-CAM based architectures.

Results

TaskDatasetMetricValueModel
Semantic SegmentationCOCO 2014 valmIoU45ViT-PCM
Semantic SegmentationPASCAL VOC 2012 trainMean IoU71.4ViT-PCM
Semantic SegmentationPASCAL VOC 2012 valMean IoU70.3ViT-PCM
Semantic SegmentationPASCAL VOC 2012 testMean IoU70.9ViT-PCM
10-shot image generationCOCO 2014 valmIoU45ViT-PCM
10-shot image generationPASCAL VOC 2012 trainMean IoU71.4ViT-PCM
10-shot image generationPASCAL VOC 2012 valMean IoU70.3ViT-PCM
10-shot image generationPASCAL VOC 2012 testMean IoU70.9ViT-PCM

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21A Semi-Supervised Learning Method for the Identification of Bad Exposures in Large Imaging Surveys2025-07-17DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17Unified Medical Image Segmentation with State Space Modeling Snake2025-07-17A Privacy-Preserving Semantic-Segmentation Method Using Domain-Adaptation Technique2025-07-17SAMST: A Transformer framework based on SAM pseudo label filtering for remote sensing semi-supervised semantic segmentation2025-07-16Tomato Multi-Angle Multi-Pose Dataset for Fine-Grained Phenotyping2025-07-15