TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Text-image Alignment for Diffusion-based Perception

Text-image Alignment for Diffusion-based Perception

Neehar Kondapaneni, Markus Marks, Manuel Knott, Rogerio Guimaraes, Pietro Perona

2023-09-29CVPR 2024 1Weakly Supervised Object DetectionSemantic SegmentationDepth EstimationImage Generationobject-detectionObject DetectionMonocular Depth Estimation
PaperPDFCodeCode(official)

Abstract

Diffusion models are generative models with impressive text-to-image synthesis capabilities and have spurred a new wave of creative methods for classical machine learning tasks. However, the best way to harness the perceptual knowledge of these generative models for visual tasks is still an open question. Specifically, it is unclear how to use the prompting interface when applying diffusion backbones to vision tasks. We find that automatically generated captions can improve text-image alignment and significantly enhance a model's cross-attention maps, leading to better perceptual performance. Our approach improves upon the current state-of-the-art (SOTA) in diffusion-based semantic segmentation on ADE20K and the current overall SOTA for depth estimation on NYUv2. Furthermore, our method generalizes to the cross-domain setting. We use model personalization and caption modifications to align our model to the target domain and find improvements over unaligned baselines. Our cross-domain object detection model, trained on Pascal VOC, achieves SOTA results on Watercolor2K. Our cross-domain segmentation method, trained on Cityscapes, achieves SOTA results on Dark Zurich-val and Nighttime Driving. Project page: https://www.vision.caltech.edu/tadp/. Code: https://github.com/damaggu/TADP.

Results

TaskDatasetMetricValueModel
Depth EstimationNYU-Depth V2Delta < 1.250.976TADP
Depth EstimationNYU-Depth V2Delta < 1.25^20.997TADP
Depth EstimationNYU-Depth V2Delta < 1.25^30.999TADP
Depth EstimationNYU-Depth V2RMSE0.225TADP
Depth EstimationNYU-Depth V2absolute relative error0.062TADP
Depth EstimationNYU-Depth V2log 100.027TADP
Semantic SegmentationNighttime DrivingmIoU60.8TADP
Semantic SegmentationADE20KValidation mIoU55.9TADP
Object DetectionComic2kMAP57.4TADP
Object DetectionWatercolor2kMAP72.2TADP
3DComic2kMAP57.4TADP
3DWatercolor2kMAP72.2TADP
3DNYU-Depth V2Delta < 1.250.976TADP
3DNYU-Depth V2Delta < 1.25^20.997TADP
3DNYU-Depth V2Delta < 1.25^30.999TADP
3DNYU-Depth V2RMSE0.225TADP
3DNYU-Depth V2absolute relative error0.062TADP
3DNYU-Depth V2log 100.027TADP
2D ClassificationComic2kMAP57.4TADP
2D ClassificationWatercolor2kMAP72.2TADP
2D Object DetectionComic2kMAP57.4TADP
2D Object DetectionWatercolor2kMAP72.2TADP
10-shot image generationNighttime DrivingmIoU60.8TADP
10-shot image generationADE20KValidation mIoU55.9TADP
16kComic2kMAP57.4TADP
16kWatercolor2kMAP72.2TADP

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17Unified Medical Image Segmentation with State Space Modeling Snake2025-07-17A Privacy-Preserving Semantic-Segmentation Method Using Domain-Adaptation Technique2025-07-17$S^2M^2$: Scalable Stereo Matching Model for Reliable Depth Estimation2025-07-17$π^3$: Scalable Permutation-Equivariant Visual Geometry Learning2025-07-17fastWDM3D: Fast and Accurate 3D Healthy Tissue Inpainting2025-07-17