StableVITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-On

Jeongho Kim, Gyojung Gu, Minho Park, Sunghyun Park, Jaegul Choo

2023-12-04CVPR 2024 1Virtual Try-on Semantic correspondence

Abstract

Given a clothing image and a person image, an image-based virtual try-on aims to generate a customized image that appears natural and accurately reflects the characteristics of the clothing image. In this work, we aim to expand the applicability of the pre-trained diffusion model so that it can be utilized independently for the virtual try-on task.The main challenge is to preserve the clothing details while effectively utilizing the robust generative capability of the pre-trained model. In order to tackle these issues, we propose StableVITON, learning the semantic correspondence between the clothing and the human body within the latent space of the pre-trained diffusion model in an end-to-end manner. Our proposed zero cross-attention blocks not only preserve the clothing details by learning the semantic correspondence but also generate high-fidelity images by utilizing the inherent knowledge of the pre-trained model in the warping process. Through our proposed novel attention total variation loss and applying augmentation, we achieve the sharp attention map, resulting in a more precise representation of clothing details. StableVITON outperforms the baselines in qualitative and quantitative evaluation, showing promising quality in arbitrary person images. Our code is available at https://github.com/rlawjdghek/StableVITON.

Results

Task	Dataset	Metric	Value	Model
Virtual Try-on	VITON-HD	FID	8.233	StableVITON
1 Image, 2*2 Stitchi	VITON-HD	FID	8.233	StableVITON

Related Papers

TalkFashion: Intelligent Virtual Try-On Assistant Based on Multimodal Large Language Model2025-07-08 Video Virtual Try-on with Conditional Diffusion Transformer Inpainter2025-06-26 RL from Physical Feedback: Aligning Large Motion Models with Humanoid Control2025-06-15 Real-Time Per-Garment Virtual Try-On with Temporal Consistency for Loose-Fitting Garments2025-06-14 Low-Barrier Dataset Collection with Real Human Body for Interactive Per-Garment Virtual Try-On2025-06-12 Jamais Vu: Exposing the Generalization Gap in Supervised Semantic Correspondence2025-06-09 Do It Yourself: Learning Semantic Correspondence from Pseudo-Labels2025-06-05 MotionRAG-Diff: A Retrieval-Augmented Diffusion Framework for Long-Term Music-to-Dance Generation2025-06-03