TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Vision Transformers for Dense Prediction

Vision Transformers for Dense Prediction

René Ranftl, Alexey Bochkovskiy, Vladlen Koltun

2021-03-24ICCV 2021 10Semantic SegmentationPredictionDepth EstimationMonocular Depth Estimation
PaperPDFCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCode

Abstract

We introduce dense vision transformers, an architecture that leverages vision transformers in place of convolutional networks as a backbone for dense prediction tasks. We assemble tokens from various stages of the vision transformer into image-like representations at various resolutions and progressively combine them into full-resolution predictions using a convolutional decoder. The transformer backbone processes representations at a constant and relatively high resolution and has a global receptive field at every stage. These properties allow the dense vision transformer to provide finer-grained and more globally coherent predictions when compared to fully-convolutional networks. Our experiments show that this architecture yields substantial improvements on dense prediction tasks, especially when a large amount of training data is available. For monocular depth estimation, we observe an improvement of up to 28% in relative performance when compared to a state-of-the-art fully-convolutional network. When applied to semantic segmentation, dense vision transformers set a new state of the art on ADE20K with 49.02% mIoU. We further show that the architecture can be fine-tuned on smaller datasets such as NYUv2, KITTI, and Pascal Context where it also sets the new state of the art. Our models are available at https://github.com/intel-isl/DPT.

Results

TaskDatasetMetricValueModel
Depth EstimationNYU-Depth V2Delta < 1.250.904DPT-Hybrid
Depth EstimationNYU-Depth V2Delta < 1.25^20.988DPT-Hybrid
Depth EstimationNYU-Depth V2Delta < 1.25^30.994DPT-Hybrid
Depth EstimationNYU-Depth V2RMSE0.357DPT-Hybrid
Depth EstimationNYU-Depth V2absolute relative error0.11DPT-Hybrid
Depth EstimationNYU-Depth V2log 100.045DPT-Hybrid
Depth EstimationETH3DDelta < 1.250.0946DPT
Depth EstimationETH3Dabsolute relative error0.078DPT
Depth EstimationKITTI Eigen splitDelta < 1.250.959DPT-Hybrid
Depth EstimationKITTI Eigen splitDelta < 1.25^20.995DPT-Hybrid
Depth EstimationKITTI Eigen splitDelta < 1.25^30.999DPT-Hybrid
Depth EstimationKITTI Eigen splitRMSE2.573DPT-Hybrid
Depth EstimationKITTI Eigen splitRMSE log0.092DPT-Hybrid
Depth EstimationKITTI Eigen splitabsolute relative error0.062DPT-Hybrid
Semantic SegmentationADE20K valPixel Accuracy83.11DPT-Hybrid
Semantic SegmentationADE20K valmIoU49.02DPT-Hybrid
Semantic SegmentationPASCAL ContextmIoU60.46DPT-Hybrid
Semantic SegmentationADE20KValidation mIoU49.02DPT-Hybrid
3DNYU-Depth V2Delta < 1.250.904DPT-Hybrid
3DNYU-Depth V2Delta < 1.25^20.988DPT-Hybrid
3DNYU-Depth V2Delta < 1.25^30.994DPT-Hybrid
3DNYU-Depth V2RMSE0.357DPT-Hybrid
3DNYU-Depth V2absolute relative error0.11DPT-Hybrid
3DNYU-Depth V2log 100.045DPT-Hybrid
3DETH3DDelta < 1.250.0946DPT
3DETH3Dabsolute relative error0.078DPT
3DKITTI Eigen splitDelta < 1.250.959DPT-Hybrid
3DKITTI Eigen splitDelta < 1.25^20.995DPT-Hybrid
3DKITTI Eigen splitDelta < 1.25^30.999DPT-Hybrid
3DKITTI Eigen splitRMSE2.573DPT-Hybrid
3DKITTI Eigen splitRMSE log0.092DPT-Hybrid
3DKITTI Eigen splitabsolute relative error0.062DPT-Hybrid
10-shot image generationADE20K valPixel Accuracy83.11DPT-Hybrid
10-shot image generationADE20K valmIoU49.02DPT-Hybrid
10-shot image generationPASCAL ContextmIoU60.46DPT-Hybrid
10-shot image generationADE20KValidation mIoU49.02DPT-Hybrid

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21Multi-Strategy Improved Snake Optimizer Accelerated CNN-LSTM-Attention-Adaboost for Trajectory Prediction2025-07-21DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17Unified Medical Image Segmentation with State Space Modeling Snake2025-07-17A Privacy-Preserving Semantic-Segmentation Method Using Domain-Adaptation Technique2025-07-17$S^2M^2$: Scalable Stereo Matching Model for Reliable Depth Estimation2025-07-17$π^3$: Scalable Permutation-Equivariant Visual Geometry Learning2025-07-17