TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/DiT-3D: Exploring Plain Diffusion Transformers for 3D Shap...

DiT-3D: Exploring Plain Diffusion Transformers for 3D Shape Generation

Shentong Mo, Enze Xie, Ruihang Chu, Lewei Yao, Lanqing Hong, Matthias Nießner, Zhenguo Li

2023-07-04NeurIPS 2023 11DenoisingPhilosophyPoint Cloud Generation3D Shape Generation
PaperPDFCode

Abstract

Recent Diffusion Transformers (e.g., DiT) have demonstrated their powerful effectiveness in generating high-quality 2D images. However, it is still being determined whether the Transformer architecture performs equally well in 3D shape generation, as previous 3D diffusion methods mostly adopted the U-Net architecture. To bridge this gap, we propose a novel Diffusion Transformer for 3D shape generation, namely DiT-3D, which can directly operate the denoising process on voxelized point clouds using plain Transformers. Compared to existing U-Net approaches, our DiT-3D is more scalable in model size and produces much higher quality generations. Specifically, the DiT-3D adopts the design philosophy of DiT but modifies it by incorporating 3D positional and patch embeddings to adaptively aggregate input from voxelized point clouds. To reduce the computational cost of self-attention in 3D shape generation, we incorporate 3D window attention into Transformer blocks, as the increased 3D token length resulting from the additional dimension of voxels can lead to high computation. Finally, linear and devoxelization layers are used to predict the denoised point clouds. In addition, our transformer architecture supports efficient fine-tuning from 2D to 3D, where the pre-trained DiT-2D checkpoint on ImageNet can significantly improve DiT-3D on ShapeNet. Experimental results on the ShapeNet dataset demonstrate that the proposed DiT-3D achieves state-of-the-art performance in high-fidelity and diverse 3D point cloud generation. In particular, our DiT-3D decreases the 1-Nearest Neighbor Accuracy of the state-of-the-art method by 4.59 and increases the Coverage metric by 3.51 when evaluated on Chamfer Distance.

Results

TaskDatasetMetricValueModel
Point Cloud GenerationShapeNet Car1-NNA-CD51.04DiT-3D
Point Cloud GenerationShapeNet CarCD56.15DiT-3D
Point Cloud GenerationShapeNet CarEMD50.86DiT-3D
Point Cloud GenerationShapeNet Airplane1-NNA-CD62.35DiT-3D
Point Cloud GenerationShapeNet AirplaneCD53.16DiT-3D
Point Cloud GenerationShapeNet AirplaneEMD54.39DiT-3D
Point Cloud GenerationShapeNet Chair1-NNA-CD51.99DiT-3D
Point Cloud GenerationShapeNet ChairCD54.76DiT-3D
Point Cloud GenerationShapeNet ChairEMD57.37DiT-3D

Related Papers

fastWDM3D: Fast and Accurate 3D Healthy Tissue Inpainting2025-07-17Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models2025-07-17Similarity-Guided Diffusion for Contrastive Sequential Recommendation2025-07-16HUG-VAS: A Hierarchical NURBS-Based Generative Model for Aortic Geometry Synthesis and Controllable Editing2025-07-15AirLLM: Diffusion Policy-based Adaptive LoRA for Remote Fine-Tuning of LLM over the Air2025-07-15Streaming 4D Visual Geometry Transformer2025-07-15A statistical physics framework for optimal learning2025-07-10LangMamba: A Language-driven Mamba Framework for Low-dose CT Denoising with Vision-language Models2025-07-08