TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Harnessing Diffusion Models for Visual Perception with Met...

Harnessing Diffusion Models for Visual Perception with Meta Prompts

Qiang Wan, Zilong Huang, Bingyi Kang, Jiashi Feng, Li Zhang

2023-12-22Semantic SegmentationPose EstimationDepth EstimationMonocular Depth Estimation
PaperPDFCode(official)

Abstract

The issue of generative pretraining for vision models has persisted as a long-standing conundrum. At present, the text-to-image (T2I) diffusion model demonstrates remarkable proficiency in generating high-definition images matching textual inputs, a feat made possible through its pre-training on large-scale image-text pairs. This leads to a natural inquiry: can diffusion models be utilized to tackle visual perception tasks? In this paper, we propose a simple yet effective scheme to harness a diffusion model for visual perception tasks. Our key insight is to introduce learnable embeddings (meta prompts) to the pre-trained diffusion models to extract proper features for perception. The effect of meta prompts are two-fold. First, as a direct replacement of the text embeddings in the T2I models, it can activate task-relevant features during feature extraction. Second, it will be used to re-arrange the extracted features to ensures that the model focuses on the most pertinent features for the task on hand. Additionally, we design a recurrent refinement training strategy that fully leverages the property of diffusion models, thereby yielding stronger visual features. Extensive experiments across various benchmarks validate the effectiveness of our approach. Our approach achieves new performance records in depth estimation tasks on NYU depth V2 and KITTI, and in semantic segmentation task on CityScapes. Concurrently, the proposed method attains results comparable to the current state-of-the-art in semantic segmentation on ADE20K and pose estimation on COCO datasets, further exemplifying its robustness and versatility.

Results

TaskDatasetMetricValueModel
Depth EstimationNYU-Depth V2Delta < 1.250.976MetaPrompt-SD
Depth EstimationNYU-Depth V2Delta < 1.25^20.997MetaPrompt-SD
Depth EstimationNYU-Depth V2Delta < 1.25^30.999MetaPrompt-SD
Depth EstimationNYU-Depth V2RMSE0.223MetaPrompt-SD
Depth EstimationNYU-Depth V2absolute relative error0.061MetaPrompt-SD
Depth EstimationNYU-Depth V2log 100.027MetaPrompt-SD
Depth EstimationKITTI Eigen splitDelta < 1.250.981MetaPrompt-SD
Depth EstimationKITTI Eigen splitDelta < 1.25^20.998MetaPrompt-SD
Depth EstimationKITTI Eigen splitDelta < 1.25^31MetaPrompt-SD
Depth EstimationKITTI Eigen splitRMSE1.928MetaPrompt-SD
Depth EstimationKITTI Eigen splitRMSE log0.071MetaPrompt-SD
Depth EstimationKITTI Eigen splitSq Rel0.125MetaPrompt-SD
Depth EstimationKITTI Eigen splitabsolute relative error0.047MetaPrompt-SD
Semantic SegmentationCityscapes testMean IoU (class)86.2MetaPrompt-SD
Semantic SegmentationCityscapes valmIoU87.1MetaPrompt-SD
Semantic SegmentationADE20KValidation mIoU56.8MetaPrompt-SD
Pose EstimationCOCO (Common Objects in Context)AP79MetaPrompt-SD
3DCOCO (Common Objects in Context)AP79MetaPrompt-SD
3DNYU-Depth V2Delta < 1.250.976MetaPrompt-SD
3DNYU-Depth V2Delta < 1.25^20.997MetaPrompt-SD
3DNYU-Depth V2Delta < 1.25^30.999MetaPrompt-SD
3DNYU-Depth V2RMSE0.223MetaPrompt-SD
3DNYU-Depth V2absolute relative error0.061MetaPrompt-SD
3DNYU-Depth V2log 100.027MetaPrompt-SD
3DKITTI Eigen splitDelta < 1.250.981MetaPrompt-SD
3DKITTI Eigen splitDelta < 1.25^20.998MetaPrompt-SD
3DKITTI Eigen splitDelta < 1.25^31MetaPrompt-SD
3DKITTI Eigen splitRMSE1.928MetaPrompt-SD
3DKITTI Eigen splitRMSE log0.071MetaPrompt-SD
3DKITTI Eigen splitSq Rel0.125MetaPrompt-SD
3DKITTI Eigen splitabsolute relative error0.047MetaPrompt-SD
10-shot image generationCityscapes testMean IoU (class)86.2MetaPrompt-SD
10-shot image generationCityscapes valmIoU87.1MetaPrompt-SD
10-shot image generationADE20KValidation mIoU56.8MetaPrompt-SD
1 Image, 2*2 StitchiCOCO (Common Objects in Context)AP79MetaPrompt-SD

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17Unified Medical Image Segmentation with State Space Modeling Snake2025-07-17A Privacy-Preserving Semantic-Segmentation Method Using Domain-Adaptation Technique2025-07-17$π^3$: Scalable Permutation-Equivariant Visual Geometry Learning2025-07-17Revisiting Reliability in the Reasoning-based Pose Estimation Benchmark2025-07-17DINO-VO: A Feature-based Visual Odometry Leveraging a Visual Foundation Model2025-07-17