TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Text-guided 3D Human Generation from 2D Collections

Text-guided 3D Human Generation from 2D Collections

Tsu-Jui Fu, Wenhan Xiong, Yixin Nie, Jingyu Liu, Barlas Oğuz, William Yang Wang

2023-05-23Text-to-3D-Human Generationtext-to-3d-human3D geometry
PaperPDF

Abstract

3D human modeling has been widely used for engaging interaction in gaming, film, and animation. The customization of these characters is crucial for creativity and scalability, which highlights the importance of controllability. In this work, we introduce Text-guided 3D Human Generation (\texttt{T3H}), where a model is to generate a 3D human, guided by the fashion description. There are two goals: 1) the 3D human should render articulately, and 2) its outfit is controlled by the given text. To address this \texttt{T3H} task, we propose Compositional Cross-modal Human (CCH). CCH adopts cross-modal attention to fuse compositional human rendering with the extracted fashion semantics. Each human body part perceives relevant textual guidance as its visual patterns. We incorporate the human prior and semantic discrimination to enhance 3D geometry transformation and fine-grained consistency, enabling it to learn from 2D collections for data efficiency. We conduct evaluations on DeepFashion and SHHQ with diverse fashion attributes covering the shape, fabric, and color of upper and lower clothing. Extensive experiments demonstrate that CCH achieves superior results for \texttt{T3H} with high efficiency.

Results

TaskDatasetMetricValueModel
Text-to-3D-Human GenerationDeepFashionCLIP Score25.031CCH
Text-to-3D-Human GenerationDeepFashionDepth Error1.21CCH
Text-to-3D-Human GenerationDeepFashionFashion Accuracy72.038CCH
Text-to-3D-Human GenerationDeepFashionFrechet Inception Distance22.175CCH
Text-to-3D-Human GenerationDeepFashionPercentage of Correct Keypoints88.313CCH
Text-to-3D-Human GenerationSHHQCLIP Score27.873CCH
Text-to-3D-Human GenerationSHHQDepth Error1.67CCH
Text-to-3D-Human GenerationSHHQFashion Accuracy76.194CCH
Text-to-3D-Human GenerationSHHQFrechet Inception Distance33.348CCH
Text-to-3D-Human GenerationSHHQPercentage of Correct Keypoints87.879CCH

Related Papers

Robust 3D-Masked Part-level Editing in 3D Gaussian Splatting with Regularized Score Distillation Sampling2025-07-15TRAN-D: 2D Gaussian Splatting-based Sparse-view Transparent Object Depth Reconstruction via Physics Simulation for Scene Update2025-07-15Feed-Forward SceneDINO for Unsupervised Semantic Scene Completion2025-07-08DreamGrasp: Zero-Shot 3D Multi-Object Reconstruction from Partial-View Images for Robotic Manipulation2025-07-08RoboScape: Physics-informed Embodied World Model2025-06-29DBMovi-GS: Dynamic View Synthesis from Blurry Monocular Video via Sparse-Controlled Gaussian Splatting2025-06-26PanSt3R: Multi-view Consistent Panoptic Segmentation2025-06-26Dense 3D Displacement Estimation for Landslide Monitoring via Fusion of TLS Point Clouds and Embedded RGB Images2025-06-19