Mingyuan Zhang, Xinying Guo, Liang Pan, Zhongang Cai, Fangzhou Hong, Huirong Li, Lei Yang, Ziwei Liu
3D human motion generation is crucial for creative industry. Recent advances rely on generative models with domain knowledge for text-driven motion generation, leading to substantial progress in capturing common motions. However, the performance on more diverse motions remains unsatisfactory. In this work, we propose ReMoDiffuse, a diffusion-model-based motion generation framework that integrates a retrieval mechanism to refine the denoising process. ReMoDiffuse enhances the generalizability and diversity of text-driven motion generation with three key designs: 1) Hybrid Retrieval finds appropriate references from the database in terms of both semantic and kinematic similarities. 2) Semantic-Modulated Transformer selectively absorbs retrieval knowledge, adapting to the difference between retrieved samples and the target motion sequence. 3) Condition Mixture better utilizes the retrieval database during inference, overcoming the scale sensitivity in classifier-free guidance. Extensive experiments demonstrate that ReMoDiffuse outperforms state-of-the-art methods by balancing both text-motion consistency and motion quality, especially for more diverse motion generation.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Pose Tracking | HumanML3D | Diversity | 9.018 | ReMoDiffuse |
| Pose Tracking | HumanML3D | FID | 0.103 | ReMoDiffuse |
| Pose Tracking | HumanML3D | Multimodality | 1.795 | ReMoDiffuse |
| Pose Tracking | HumanML3D | R Precision Top3 | 0.795 | ReMoDiffuse |
| Pose Tracking | KIT Motion-Language | Diversity | 10.8 | ReMoDiffuse |
| Pose Tracking | KIT Motion-Language | FID | 0.155 | ReMoDiffuse |
| Pose Tracking | KIT Motion-Language | Multimodality | 1.239 | ReMoDiffuse |
| Pose Tracking | KIT Motion-Language | R Precision Top3 | 0.765 | ReMoDiffuse |
| Motion Synthesis | HumanML3D | Diversity | 9.018 | ReMoDiffuse |
| Motion Synthesis | HumanML3D | FID | 0.103 | ReMoDiffuse |
| Motion Synthesis | HumanML3D | Multimodality | 1.795 | ReMoDiffuse |
| Motion Synthesis | HumanML3D | R Precision Top3 | 0.795 | ReMoDiffuse |
| Motion Synthesis | KIT Motion-Language | Diversity | 10.8 | ReMoDiffuse |
| Motion Synthesis | KIT Motion-Language | FID | 0.155 | ReMoDiffuse |
| Motion Synthesis | KIT Motion-Language | Multimodality | 1.239 | ReMoDiffuse |
| Motion Synthesis | KIT Motion-Language | R Precision Top3 | 0.765 | ReMoDiffuse |
| 10-shot image generation | HumanML3D | Diversity | 9.018 | ReMoDiffuse |
| 10-shot image generation | HumanML3D | FID | 0.103 | ReMoDiffuse |
| 10-shot image generation | HumanML3D | Multimodality | 1.795 | ReMoDiffuse |
| 10-shot image generation | HumanML3D | R Precision Top3 | 0.795 | ReMoDiffuse |
| 10-shot image generation | KIT Motion-Language | Diversity | 10.8 | ReMoDiffuse |
| 10-shot image generation | KIT Motion-Language | FID | 0.155 | ReMoDiffuse |
| 10-shot image generation | KIT Motion-Language | Multimodality | 1.239 | ReMoDiffuse |
| 10-shot image generation | KIT Motion-Language | R Precision Top3 | 0.765 | ReMoDiffuse |
| 3D Human Pose Tracking | HumanML3D | Diversity | 9.018 | ReMoDiffuse |
| 3D Human Pose Tracking | HumanML3D | FID | 0.103 | ReMoDiffuse |
| 3D Human Pose Tracking | HumanML3D | Multimodality | 1.795 | ReMoDiffuse |
| 3D Human Pose Tracking | HumanML3D | R Precision Top3 | 0.795 | ReMoDiffuse |
| 3D Human Pose Tracking | KIT Motion-Language | Diversity | 10.8 | ReMoDiffuse |
| 3D Human Pose Tracking | KIT Motion-Language | FID | 0.155 | ReMoDiffuse |
| 3D Human Pose Tracking | KIT Motion-Language | Multimodality | 1.239 | ReMoDiffuse |
| 3D Human Pose Tracking | KIT Motion-Language | R Precision Top3 | 0.765 | ReMoDiffuse |