TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/kMaX-DeepLab: k-means Mask Transformer

kMaX-DeepLab: k-means Mask Transformer

Qihang Yu, Huiyu Wang, Siyuan Qiao, Maxwell Collins, Yukun Zhu, Hartwig Adam, Alan Yuille, Liang-Chieh Chen

2022-07-08Panoptic SegmentationSemantic SegmentationClusteringObject Detection
PaperPDFCodeCode(official)Code(official)

Abstract

The rise of transformers in vision tasks not only advances network backbone designs, but also starts a brand-new page to achieve end-to-end image recognition (e.g., object detection and panoptic segmentation). Originated from Natural Language Processing (NLP), transformer architectures, consisting of self-attention and cross-attention, effectively learn long-range interactions between elements in a sequence. However, we observe that most existing transformer-based vision models simply borrow the idea from NLP, neglecting the crucial difference between languages and images, particularly the extremely large sequence length of spatially flattened pixel features. This subsequently impedes the learning in cross-attention between pixel features and object queries. In this paper, we rethink the relationship between pixels and object queries and propose to reformulate the cross-attention learning as a clustering process. Inspired by the traditional k-means clustering algorithm, we develop a k-means Mask Xformer (kMaX-DeepLab) for segmentation tasks, which not only improves the state-of-the-art, but also enjoys a simple and elegant design. As a result, our kMaX-DeepLab achieves a new state-of-the-art performance on COCO val set with 58.0% PQ, Cityscapes val set with 68.4% PQ, 44.0% AP, and 83.5% mIoU, and ADE20K val set with 50.9% PQ and 55.2% mIoU without test-time augmentation or external dataset. We hope our work can shed some light on designing transformers tailored for vision tasks. TensorFlow code and models are available at https://github.com/google-research/deeplab2 A PyTorch re-implementation is also available at https://github.com/bytedance/kmax-deeplab

Results

TaskDatasetMetricValueModel
Semantic SegmentationCityscapes testPQ66.2kMaX-DeepLab (single-scale)
Semantic SegmentationCityscapes valAP44kMaX-DeepLab (single-scale)
Semantic SegmentationCityscapes valPQ68.4kMaX-DeepLab (single-scale)
Semantic SegmentationCityscapes valmIoU83.5kMaX-DeepLab (single-scale)
Semantic SegmentationCOCO test-devPQ58.5kMaX-DeepLab (single-scale)
Semantic SegmentationCOCO test-devPQst49kMaX-DeepLab (single-scale)
Semantic SegmentationCOCO test-devPQth64.8kMaX-DeepLab (single-scale)
Semantic SegmentationADE20K valPQ50.9kMaX-DeepLab (ConvNeXt-L, single-scale, 1281x1281)
Semantic SegmentationADE20K valmIoU55.2kMaX-DeepLab (ConvNeXt-L, single-scale, 1281x1281)
Semantic SegmentationADE20K valPQ48.7kMaX-DeepLab (ConvNeXt-L, single-scale, 641x641)
Semantic SegmentationADE20K valmIoU54.8kMaX-DeepLab (ConvNeXt-L, single-scale, 641x641)
Semantic SegmentationADE20K valPQ42.3kMaX-DeepLab (ResNet50, single-scale, 1281x1281)
Semantic SegmentationADE20K valmIoU45.3kMaX-DeepLab (ResNet50, single-scale, 1281x1281)
Semantic SegmentationADE20K valPQ41.5kMaX-DeepLab (ResNet50, single-scale, 641x641)
Semantic SegmentationADE20K valmIoU45kMaX-DeepLab (ResNet50, single-scale, 641x641)
Semantic SegmentationCOCO minivalPQ58.1kMaX-DeepLab (single-scale, pseudo-labels)
Semantic SegmentationCOCO minivalPQst48.8kMaX-DeepLab (single-scale, pseudo-labels)
Semantic SegmentationCOCO minivalPQth64.3kMaX-DeepLab (single-scale, pseudo-labels)
Semantic SegmentationCOCO minivalPQ58kMaX-DeepLab (single-scale, drop query with 256 queries)
Semantic SegmentationCOCO minivalPQst48.6kMaX-DeepLab (single-scale, drop query with 256 queries)
Semantic SegmentationCOCO minivalPQth64.2kMaX-DeepLab (single-scale, drop query with 256 queries)
Semantic SegmentationCOCO minivalPQ57.9kMaX-DeepLab (single-scale)
Semantic SegmentationCOCO minivalPQst48.6kMaX-DeepLab (single-scale)
Semantic SegmentationCOCO minivalPQth64kMaX-DeepLab (single-scale)
10-shot image generationCityscapes testPQ66.2kMaX-DeepLab (single-scale)
10-shot image generationCityscapes valAP44kMaX-DeepLab (single-scale)
10-shot image generationCityscapes valPQ68.4kMaX-DeepLab (single-scale)
10-shot image generationCityscapes valmIoU83.5kMaX-DeepLab (single-scale)
10-shot image generationCOCO test-devPQ58.5kMaX-DeepLab (single-scale)
10-shot image generationCOCO test-devPQst49kMaX-DeepLab (single-scale)
10-shot image generationCOCO test-devPQth64.8kMaX-DeepLab (single-scale)
10-shot image generationADE20K valPQ50.9kMaX-DeepLab (ConvNeXt-L, single-scale, 1281x1281)
10-shot image generationADE20K valmIoU55.2kMaX-DeepLab (ConvNeXt-L, single-scale, 1281x1281)
10-shot image generationADE20K valPQ48.7kMaX-DeepLab (ConvNeXt-L, single-scale, 641x641)
10-shot image generationADE20K valmIoU54.8kMaX-DeepLab (ConvNeXt-L, single-scale, 641x641)
10-shot image generationADE20K valPQ42.3kMaX-DeepLab (ResNet50, single-scale, 1281x1281)
10-shot image generationADE20K valmIoU45.3kMaX-DeepLab (ResNet50, single-scale, 1281x1281)
10-shot image generationADE20K valPQ41.5kMaX-DeepLab (ResNet50, single-scale, 641x641)
10-shot image generationADE20K valmIoU45kMaX-DeepLab (ResNet50, single-scale, 641x641)
10-shot image generationCOCO minivalPQ58.1kMaX-DeepLab (single-scale, pseudo-labels)
10-shot image generationCOCO minivalPQst48.8kMaX-DeepLab (single-scale, pseudo-labels)
10-shot image generationCOCO minivalPQth64.3kMaX-DeepLab (single-scale, pseudo-labels)
10-shot image generationCOCO minivalPQ58kMaX-DeepLab (single-scale, drop query with 256 queries)
10-shot image generationCOCO minivalPQst48.6kMaX-DeepLab (single-scale, drop query with 256 queries)
10-shot image generationCOCO minivalPQth64.2kMaX-DeepLab (single-scale, drop query with 256 queries)
10-shot image generationCOCO minivalPQ57.9kMaX-DeepLab (single-scale)
10-shot image generationCOCO minivalPQst48.6kMaX-DeepLab (single-scale)
10-shot image generationCOCO minivalPQth64kMaX-DeepLab (single-scale)
Panoptic SegmentationCityscapes testPQ66.2kMaX-DeepLab (single-scale)
Panoptic SegmentationCityscapes valAP44kMaX-DeepLab (single-scale)
Panoptic SegmentationCityscapes valPQ68.4kMaX-DeepLab (single-scale)
Panoptic SegmentationCityscapes valmIoU83.5kMaX-DeepLab (single-scale)
Panoptic SegmentationCOCO test-devPQ58.5kMaX-DeepLab (single-scale)
Panoptic SegmentationCOCO test-devPQst49kMaX-DeepLab (single-scale)
Panoptic SegmentationCOCO test-devPQth64.8kMaX-DeepLab (single-scale)
Panoptic SegmentationADE20K valPQ50.9kMaX-DeepLab (ConvNeXt-L, single-scale, 1281x1281)
Panoptic SegmentationADE20K valmIoU55.2kMaX-DeepLab (ConvNeXt-L, single-scale, 1281x1281)
Panoptic SegmentationADE20K valPQ48.7kMaX-DeepLab (ConvNeXt-L, single-scale, 641x641)
Panoptic SegmentationADE20K valmIoU54.8kMaX-DeepLab (ConvNeXt-L, single-scale, 641x641)
Panoptic SegmentationADE20K valPQ42.3kMaX-DeepLab (ResNet50, single-scale, 1281x1281)
Panoptic SegmentationADE20K valmIoU45.3kMaX-DeepLab (ResNet50, single-scale, 1281x1281)
Panoptic SegmentationADE20K valPQ41.5kMaX-DeepLab (ResNet50, single-scale, 641x641)
Panoptic SegmentationADE20K valmIoU45kMaX-DeepLab (ResNet50, single-scale, 641x641)
Panoptic SegmentationCOCO minivalPQ58.1kMaX-DeepLab (single-scale, pseudo-labels)
Panoptic SegmentationCOCO minivalPQst48.8kMaX-DeepLab (single-scale, pseudo-labels)
Panoptic SegmentationCOCO minivalPQth64.3kMaX-DeepLab (single-scale, pseudo-labels)
Panoptic SegmentationCOCO minivalPQ58kMaX-DeepLab (single-scale, drop query with 256 queries)
Panoptic SegmentationCOCO minivalPQst48.6kMaX-DeepLab (single-scale, drop query with 256 queries)
Panoptic SegmentationCOCO minivalPQth64.2kMaX-DeepLab (single-scale, drop query with 256 queries)
Panoptic SegmentationCOCO minivalPQ57.9kMaX-DeepLab (single-scale)
Panoptic SegmentationCOCO minivalPQst48.6kMaX-DeepLab (single-scale)
Panoptic SegmentationCOCO minivalPQth64kMaX-DeepLab (single-scale)

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21Tri-Learn Graph Fusion Network for Attributed Graph Clustering2025-07-18DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17Unified Medical Image Segmentation with State Space Modeling Snake2025-07-17A Privacy-Preserving Semantic-Segmentation Method Using Domain-Adaptation Technique2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17RS-TinyNet: Stage-wise Feature Fusion Network for Detecting Tiny Objects in Remote Sensing Images2025-07-17