TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/PolyMaX: General Dense Prediction with Mask Transformer

PolyMaX: General Dense Prediction with Mask Transformer

Xuan Yang, Liangzhe Yuan, Kimberly Wilber, Astuti Sharma, Xiuye Gu, Siyuan Qiao, Stephanie Debats, Huisheng Wang, Hartwig Adam, Mikhail Sirotenko, Liang-Chieh Chen

2023-11-09Surface Normals EstimationSurface Normal EstimationSemantic SegmentationPredictionDepth EstimationMonocular Depth Estimation
PaperPDFCode(official)

Abstract

Dense prediction tasks, such as semantic segmentation, depth estimation, and surface normal prediction, can be easily formulated as per-pixel classification (discrete outputs) or regression (continuous outputs). This per-pixel prediction paradigm has remained popular due to the prevalence of fully convolutional networks. However, on the recent frontier of segmentation task, the community has been witnessing a shift of paradigm from per-pixel prediction to cluster-prediction with the emergence of transformer architectures, particularly the mask transformers, which directly predicts a label for a mask instead of a pixel. Despite this shift, methods based on the per-pixel prediction paradigm still dominate the benchmarks on the other dense prediction tasks that require continuous outputs, such as depth estimation and surface normal prediction. Motivated by the success of DORN and AdaBins in depth estimation, achieved by discretizing the continuous output space, we propose to generalize the cluster-prediction based method to general dense prediction tasks. This allows us to unify dense prediction tasks with the mask transformer framework. Remarkably, the resulting model PolyMaX demonstrates state-of-the-art performance on three benchmarks of NYUD-v2 dataset. We hope our simple yet effective design can inspire more research on exploiting mask transformers for more dense prediction tasks. Code and model will be made available.

Results

TaskDatasetMetricValueModel
Depth EstimationNYU-Depth V2Delta < 1.250.969PolyMaX(ConvNeXt-L)
Depth EstimationNYU-Depth V2Delta < 1.25^20.9958PolyMaX(ConvNeXt-L)
Depth EstimationNYU-Depth V2Delta < 1.25^30.999PolyMaX(ConvNeXt-L)
Depth EstimationNYU-Depth V2RMSE0.25PolyMaX(ConvNeXt-L)
Depth EstimationNYU-Depth V2absolute relative error0.067PolyMaX(ConvNeXt-L)
Depth EstimationNYU-Depth V2log 100.029PolyMaX(ConvNeXt-L)
3DNYU-Depth V2Delta < 1.250.969PolyMaX(ConvNeXt-L)
3DNYU-Depth V2Delta < 1.25^20.9958PolyMaX(ConvNeXt-L)
3DNYU-Depth V2Delta < 1.25^30.999PolyMaX(ConvNeXt-L)
3DNYU-Depth V2RMSE0.25PolyMaX(ConvNeXt-L)
3DNYU-Depth V2absolute relative error0.067PolyMaX(ConvNeXt-L)
3DNYU-Depth V2log 100.029PolyMaX(ConvNeXt-L)
Surface Normals EstimationNYU Depth v2% < 11.2565.66PolyMaX(ConvNeXt-L)
Surface Normals EstimationNYU Depth v2% < 22.582.28PolyMaX(ConvNeXt-L)
Surface Normals EstimationNYU Depth v2% < 3087.83PolyMaX(ConvNeXt-L)
Surface Normals EstimationNYU Depth v2Mean Angle Error13.09PolyMaX(ConvNeXt-L)
Surface Normals EstimationNYU Depth v2RMSE20.4PolyMaX(ConvNeXt-L)

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21Multi-Strategy Improved Snake Optimizer Accelerated CNN-LSTM-Attention-Adaboost for Trajectory Prediction2025-07-21DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17Unified Medical Image Segmentation with State Space Modeling Snake2025-07-17A Privacy-Preserving Semantic-Segmentation Method Using Domain-Adaptation Technique2025-07-17$S^2M^2$: Scalable Stereo Matching Model for Reliable Depth Estimation2025-07-17$π^3$: Scalable Permutation-Equivariant Visual Geometry Learning2025-07-17