Towards Vision-Language-Garment Models For Web Knowledge Garment Understanding and Generation
Jan Ackermann, Kiyohiro Nakayama, Guandao Yang, Tong Wu, Gordon Wetzstein
2025-06-05Zero-shot Generalization
Abstract
Multimodal foundation models have demonstrated strong generalization, yet their ability to transfer knowledge to specialized domains such as garment generation remains underexplored. We introduce VLG, a vision-language-garment model that synthesizes garments from textual descriptions and visual imagery. Our experiments assess VLG's zero-shot generalization, investigating its ability to transfer web-scale reasoning to unseen garment styles and prompts. Preliminary results indicate promising transfer capabilities, highlighting the potential for multimodal foundation models to adapt effectively to specialized domains like fashion design.
Related Papers
SAMST: A Transformer framework based on SAM pseudo label filtering for remote sensing semi-supervised semantic segmentation2025-07-16Towards Depth Foundation Model: Recent Trends in Vision-Based Depth Estimation2025-07-15PoseLLM: Enhancing Language-Guided Human Pose Estimation with MLP Alignment2025-07-12Go to Zero: Towards Zero-shot Motion Generation with Million-scale Data2025-07-09Video Event Reasoning and Prediction by Fusing World Knowledge from LLMs with Vision Foundation Models2025-07-08Helping CLIP See Both the Forest and the Trees: A Decomposition and Description Approach2025-07-04DeSTA2.5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal Alignment2025-07-03RobuSTereo: Robust Zero-Shot Stereo Matching under Adverse Weather2025-07-02