TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Unleashing Text-to-Image Diffusion Models for Visual Perce...

Unleashing Text-to-Image Diffusion Models for Visual Perception

Wenliang Zhao, Yongming Rao, Zuyan Liu, Benlin Liu, Jie zhou, Jiwen Lu

2023-03-03ICCV 2023 1DenoisingReferring Expression SegmentationSegmentationSemantic SegmentationDepth EstimationMonocular Depth EstimationImage Segmentation
PaperPDFCodeCode(official)

Abstract

Diffusion models (DMs) have become the new trend of generative models and have demonstrated a powerful ability of conditional synthesis. Among those, text-to-image diffusion models pre-trained on large-scale image-text pairs are highly controllable by customizable prompts. Unlike the unconditional generative models that focus on low-level attributes and details, text-to-image diffusion models contain more high-level knowledge thanks to the vision-language pre-training. In this paper, we propose VPD (Visual Perception with a pre-trained Diffusion model), a new framework that exploits the semantic information of a pre-trained text-to-image diffusion model in visual perception tasks. Instead of using the pre-trained denoising autoencoder in a diffusion-based pipeline, we simply use it as a backbone and aim to study how to take full advantage of the learned knowledge. Specifically, we prompt the denoising decoder with proper textual inputs and refine the text features with an adapter, leading to a better alignment to the pre-trained stage and making the visual contents interact with the text prompts. We also propose to utilize the cross-attention maps between the visual features and the text features to provide explicit guidance. Compared with other pre-training methods, we show that vision-language pre-trained diffusion models can be faster adapted to downstream visual perception tasks using the proposed VPD. Extensive experiments on semantic segmentation, referring image segmentation and depth estimation demonstrates the effectiveness of our method. Notably, VPD attains 0.254 RMSE on NYUv2 depth estimation and 73.3% oIoU on RefCOCO-val referring image segmentation, establishing new records on these two benchmarks. Code is available at https://github.com/wl-zhao/VPD

Results

TaskDatasetMetricValueModel
Depth EstimationNYU-Depth V2Delta < 1.250.964VPD
Depth EstimationNYU-Depth V2Delta < 1.25^20.995VPD
Depth EstimationNYU-Depth V2Delta < 1.25^30.999VPD
Depth EstimationNYU-Depth V2RMSE0.254VPD
Depth EstimationNYU-Depth V2absolute relative error0.069VPD
Depth EstimationNYU-Depth V2log 100.03VPD
3DNYU-Depth V2Delta < 1.250.964VPD
3DNYU-Depth V2Delta < 1.25^20.995VPD
3DNYU-Depth V2Delta < 1.25^30.999VPD
3DNYU-Depth V2RMSE0.254VPD
3DNYU-Depth V2absolute relative error0.069VPD
3DNYU-Depth V2log 100.03VPD
Instance SegmentationRefCoCo valOverall IoU73.25VPD
Referring Expression SegmentationRefCoCo valOverall IoU73.25VPD

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21fastWDM3D: Fast and Accurate 3D Healthy Tissue Inpainting2025-07-17Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models2025-07-17Deep Learning-Based Fetal Lung Segmentation from Diffusion-weighted MRI Images and Lung Maturity Evaluation for Fetal Growth Restriction2025-07-17DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17From Variability To Accuracy: Conditional Bernoulli Diffusion Models with Consensus-Driven Correction for Thin Structure Segmentation2025-07-17Unleashing Vision Foundation Models for Coronary Artery Segmentation: Parallel ViT-CNN Encoding and Variational Fusion2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17