Wenyang Zhou, Zhiyang Dou, Zeyu Cao, Zhouyingcheng Liao, Jingbo Wang, Wenjia Wang, YuAn Liu, Taku Komura, Wenping Wang, Lingjie Liu
We introduce Efficient Motion Diffusion Model (EMDM) for fast and high-quality human motion generation. Current state-of-the-art generative diffusion models have produced impressive results but struggle to achieve fast generation without sacrificing quality. On the one hand, previous works, like motion latent diffusion, conduct diffusion within a latent space for efficiency, but learning such a latent space can be a non-trivial effort. On the other hand, accelerating generation by naively increasing the sampling step size, e.g., DDIM, often leads to quality degradation as it fails to approximate the complex denoising distribution. To address these issues, we propose EMDM, which captures the complex distribution during multiple sampling steps in the diffusion model, allowing for much fewer sampling steps and significant acceleration in generation. This is achieved by a conditional denoising diffusion GAN to capture multimodal data distributions among arbitrary (and potentially larger) step sizes conditioned on control signals, enabling fewer-step motion sampling with high fidelity and diversity. To minimize undesired motion artifacts, geometric losses are imposed during network learning. As a result, EMDM achieves real-time motion generation and significantly improves the efficiency of motion diffusion models compared to existing methods while achieving high-quality motion generation. Our code will be publicly available upon publication.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Pose Tracking | HumanML3D | Diversity | 9.551 | EMDM |
| Pose Tracking | HumanML3D | FID | 0.112 | EMDM |
| Pose Tracking | HumanML3D | Multimodality | 1.641 | EMDM |
| Pose Tracking | HumanML3D | R Precision Top3 | 0.786 | EMDM |
| Pose Tracking | KIT Motion-Language | Diversity | 10.96 | EMDM |
| Pose Tracking | KIT Motion-Language | FID | 0.261 | EMDM |
| Pose Tracking | KIT Motion-Language | Multimodality | 1.343 | EMDM |
| Pose Tracking | KIT Motion-Language | R Precision Top3 | 0.78 | EMDM |
| Motion Synthesis | HumanML3D | Diversity | 9.551 | EMDM |
| Motion Synthesis | HumanML3D | FID | 0.112 | EMDM |
| Motion Synthesis | HumanML3D | Multimodality | 1.641 | EMDM |
| Motion Synthesis | HumanML3D | R Precision Top3 | 0.786 | EMDM |
| Motion Synthesis | KIT Motion-Language | Diversity | 10.96 | EMDM |
| Motion Synthesis | KIT Motion-Language | FID | 0.261 | EMDM |
| Motion Synthesis | KIT Motion-Language | Multimodality | 1.343 | EMDM |
| Motion Synthesis | KIT Motion-Language | R Precision Top3 | 0.78 | EMDM |
| 10-shot image generation | HumanML3D | Diversity | 9.551 | EMDM |
| 10-shot image generation | HumanML3D | FID | 0.112 | EMDM |
| 10-shot image generation | HumanML3D | Multimodality | 1.641 | EMDM |
| 10-shot image generation | HumanML3D | R Precision Top3 | 0.786 | EMDM |
| 10-shot image generation | KIT Motion-Language | Diversity | 10.96 | EMDM |
| 10-shot image generation | KIT Motion-Language | FID | 0.261 | EMDM |
| 10-shot image generation | KIT Motion-Language | Multimodality | 1.343 | EMDM |
| 10-shot image generation | KIT Motion-Language | R Precision Top3 | 0.78 | EMDM |
| 3D Human Pose Tracking | HumanML3D | Diversity | 9.551 | EMDM |
| 3D Human Pose Tracking | HumanML3D | FID | 0.112 | EMDM |
| 3D Human Pose Tracking | HumanML3D | Multimodality | 1.641 | EMDM |
| 3D Human Pose Tracking | HumanML3D | R Precision Top3 | 0.786 | EMDM |
| 3D Human Pose Tracking | KIT Motion-Language | Diversity | 10.96 | EMDM |
| 3D Human Pose Tracking | KIT Motion-Language | FID | 0.261 | EMDM |
| 3D Human Pose Tracking | KIT Motion-Language | Multimodality | 1.343 | EMDM |
| 3D Human Pose Tracking | KIT Motion-Language | R Precision Top3 | 0.78 | EMDM |