Shuai Bai, Huiling Zhou, Zhikang Li, Chang Zhou, Hongxia Yang
Virtual try-on aims to generate a photo-realistic fitting result given an in-shop garment and a reference person image. Existing methods usually build up multi-stage frameworks to deal with clothes warping and body blending respectively, or rely heavily on intermediate parser-based labels which may be noisy or even inaccurate. To solve the above challenges, we propose a single-stage try-on framework by developing a novel Deformable Attention Flow (DAFlow), which applies the deformable attention scheme to multi-flow estimation. With pose keypoints as the guidance only, the self- and cross-deformable attention flows are estimated for the reference person and the garment images, respectively. By sampling multiple flow fields, the feature-level and pixel-level information from different semantic areas are simultaneously extracted and merged through the attention mechanism. It enables clothes warping and body synthesizing at the same time which leads to photo-realistic results in an end-to-end manner. Extensive experiments on two try-on datasets demonstrate that our proposed method achieves state-of-the-art performance both qualitatively and quantitatively. Furthermore, additional experiments on the other two image editing tasks illustrate the versatility of our method for multi-view synthesis and image animation.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Virtual Try-on | VITON | FID | 10.97 | SDAFN |
| Virtual Try-on | VITON | IS | 2.859 | SDAFN |
| Virtual Try-on | VITON | PSNR | 26.48 | SDAFN |
| Virtual Try-on | VITON | SSIM | 0.888 | SDAFN |
| 1 Image, 2*2 Stitchi | VITON | FID | 10.97 | SDAFN |
| 1 Image, 2*2 Stitchi | VITON | IS | 2.859 | SDAFN |
| 1 Image, 2*2 Stitchi | VITON | PSNR | 26.48 | SDAFN |
| 1 Image, 2*2 Stitchi | VITON | SSIM | 0.888 | SDAFN |