Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think

Gonzalo Martin Garcia, Karim Abou Zeid, Christian Schmidt, Daan de Geus, Alexander Hermans, Bastian Leibe

2024-09-17Surface Normals Estimation Surface Normal Estimation Depth Estimation Image Generation Conditional Image Generation Monocular Depth Estimation

Paper PDF Code(official)

Abstract

Recent work showed that large diffusion models can be reused as highly precise monocular depth estimators by casting depth estimation as an image-conditional image generation task. While the proposed model achieved state-of-the-art results, high computational demands due to multi-step inference limited its use in many scenarios. In this paper, we show that the perceived inefficiency was caused by a flaw in the inference pipeline that has so far gone unnoticed. The fixed model performs comparably to the best previously reported configuration while being more than 200$\times$ faster. To optimize for downstream task performance, we perform end-to-end fine-tuning on top of the single-step model with task-specific losses and get a deterministic model that outperforms all other diffusion-based depth and normal estimation models on common zero-shot benchmarks. We surprisingly find that this fine-tuning protocol also works directly on Stable Diffusion and achieves comparable performance to current state-of-the-art diffusion-based depth and normal estimation models, calling into question some of the conclusions drawn from prior works.

Results

Task	Dataset	Metric	Value	Model
Depth Estimation	NYU-Depth V2	Delta < 1.25	0.966	Marigold + E2E FT(zero-shot)
Depth Estimation	NYU-Depth V2	absolute relative error	0.052	Marigold + E2E FT(zero-shot)
3D	NYU-Depth V2	Delta < 1.25	0.966	Marigold + E2E FT(zero-shot)
3D	NYU-Depth V2	absolute relative error	0.052	Marigold + E2E FT(zero-shot)
Surface Normals Estimation	IBims-1	% < 11.25	69.9	Marigold + E2E FT(zero-shot)
Surface Normals Estimation	IBims-1	Mean	15.8	Marigold + E2E FT(zero-shot)
Surface Normals Estimation	NYU Depth v2	% < 11.25	61.4	Marigold + E2E FT(zero-shot)
Surface Normals Estimation	NYU Depth v2	Mean Angle Error	16.2	Marigold + E2E FT(zero-shot)

Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think

Abstract

Results

Related Papers

Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think

Abstract

Results

Related Papers