Harnessing Diffusion Models for Visual Perception with Meta Prompts

Qiang Wan, Zilong Huang, Bingyi Kang, Jiashi Feng, Li Zhang

2023-12-22Semantic Segmentation Pose Estimation Depth Estimation Monocular Depth Estimation

Abstract

The issue of generative pretraining for vision models has persisted as a long-standing conundrum. At present, the text-to-image (T2I) diffusion model demonstrates remarkable proficiency in generating high-definition images matching textual inputs, a feat made possible through its pre-training on large-scale image-text pairs. This leads to a natural inquiry: can diffusion models be utilized to tackle visual perception tasks? In this paper, we propose a simple yet effective scheme to harness a diffusion model for visual perception tasks. Our key insight is to introduce learnable embeddings (meta prompts) to the pre-trained diffusion models to extract proper features for perception. The effect of meta prompts are two-fold. First, as a direct replacement of the text embeddings in the T2I models, it can activate task-relevant features during feature extraction. Second, it will be used to re-arrange the extracted features to ensures that the model focuses on the most pertinent features for the task on hand. Additionally, we design a recurrent refinement training strategy that fully leverages the property of diffusion models, thereby yielding stronger visual features. Extensive experiments across various benchmarks validate the effectiveness of our approach. Our approach achieves new performance records in depth estimation tasks on NYU depth V2 and KITTI, and in semantic segmentation task on CityScapes. Concurrently, the proposed method attains results comparable to the current state-of-the-art in semantic segmentation on ADE20K and pose estimation on COCO datasets, further exemplifying its robustness and versatility.

Results

Task	Dataset	Metric	Value	Model
Depth Estimation	NYU-Depth V2	Delta < 1.25	0.976	MetaPrompt-SD
Depth Estimation	NYU-Depth V2	Delta < 1.25^2	0.997	MetaPrompt-SD
Depth Estimation	NYU-Depth V2	Delta < 1.25^3	0.999	MetaPrompt-SD
Depth Estimation	NYU-Depth V2	RMSE	0.223	MetaPrompt-SD
Depth Estimation	NYU-Depth V2	absolute relative error	0.061	MetaPrompt-SD
Depth Estimation	NYU-Depth V2	log 10	0.027	MetaPrompt-SD
Depth Estimation	KITTI Eigen split	Delta < 1.25	0.981	MetaPrompt-SD
Depth Estimation	KITTI Eigen split	Delta < 1.25^2	0.998	MetaPrompt-SD
Depth Estimation	KITTI Eigen split	Delta < 1.25^3	1	MetaPrompt-SD
Depth Estimation	KITTI Eigen split	RMSE	1.928	MetaPrompt-SD
Depth Estimation	KITTI Eigen split	RMSE log	0.071	MetaPrompt-SD
Depth Estimation	KITTI Eigen split	Sq Rel	0.125	MetaPrompt-SD
Depth Estimation	KITTI Eigen split	absolute relative error	0.047	MetaPrompt-SD
Semantic Segmentation	Cityscapes test	Mean IoU (class)	86.2	MetaPrompt-SD
Semantic Segmentation	Cityscapes val	mIoU	87.1	MetaPrompt-SD
Semantic Segmentation	ADE20K	Validation mIoU	56.8	MetaPrompt-SD
Pose Estimation	COCO (Common Objects in Context)	AP	79	MetaPrompt-SD
3D	COCO (Common Objects in Context)	AP	79	MetaPrompt-SD
3D	NYU-Depth V2	Delta < 1.25	0.976	MetaPrompt-SD
3D	NYU-Depth V2	Delta < 1.25^2	0.997	MetaPrompt-SD
3D	NYU-Depth V2	Delta < 1.25^3	0.999	MetaPrompt-SD
3D	NYU-Depth V2	RMSE	0.223	MetaPrompt-SD
3D	NYU-Depth V2	absolute relative error	0.061	MetaPrompt-SD
3D	NYU-Depth V2	log 10	0.027	MetaPrompt-SD
3D	KITTI Eigen split	Delta < 1.25	0.981	MetaPrompt-SD
3D	KITTI Eigen split	Delta < 1.25^2	0.998	MetaPrompt-SD
3D	KITTI Eigen split	Delta < 1.25^3	1	MetaPrompt-SD
3D	KITTI Eigen split	RMSE	1.928	MetaPrompt-SD
3D	KITTI Eigen split	RMSE log	0.071	MetaPrompt-SD
3D	KITTI Eigen split	Sq Rel	0.125	MetaPrompt-SD
3D	KITTI Eigen split	absolute relative error	0.047	MetaPrompt-SD
10-shot image generation	Cityscapes test	Mean IoU (class)	86.2	MetaPrompt-SD
10-shot image generation	Cityscapes val	mIoU	87.1	MetaPrompt-SD
10-shot image generation	ADE20K	Validation mIoU	56.8	MetaPrompt-SD
1 Image, 2*2 Stitchi	COCO (Common Objects in Context)	AP	79	MetaPrompt-SD

Harnessing Diffusion Models for Visual Perception with Meta Prompts

Abstract

Results

Related Papers

Harnessing Diffusion Models for Visual Perception with Meta Prompts

Abstract

Results

Related Papers