Towards Vision-Language-Garment Models For Web Knowledge Garment Understanding and Generation

Jan Ackermann, Kiyohiro Nakayama, Guandao Yang, Tong Wu, Gordon Wetzstein

2025-06-05Zero-shot Generalization

Abstract

Multimodal foundation models have demonstrated strong generalization, yet their ability to transfer knowledge to specialized domains such as garment generation remains underexplored. We introduce VLG, a vision-language-garment model that synthesizes garments from textual descriptions and visual imagery. Our experiments assess VLG's zero-shot generalization, investigating its ability to transfer web-scale reasoning to unseen garment styles and prompts. Preliminary results indicate promising transfer capabilities, highlighting the potential for multimodal foundation models to adapt effectively to specialized domains like fashion design.

Related Papers

SAMST: A Transformer framework based on SAM pseudo label filtering for remote sensing semi-supervised semantic segmentation2025-07-16 Towards Depth Foundation Model: Recent Trends in Vision-Based Depth Estimation2025-07-15 PoseLLM: Enhancing Language-Guided Human Pose Estimation with MLP Alignment2025-07-12 Go to Zero: Towards Zero-shot Motion Generation with Million-scale Data2025-07-09 Video Event Reasoning and Prediction by Fusing World Knowledge from LLMs with Vision Foundation Models2025-07-08 Helping CLIP See Both the Forest and the Trees: A Decomposition and Description Approach2025-07-04 DeSTA2.5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal Alignment2025-07-03 RobuSTereo: Robust Zero-Shot Stereo Matching under Adverse Weather2025-07-02