Mingyuan Zhang, Huirong Li, Zhongang Cai, Jiawei Ren, Lei Yang, Ziwei Liu
Text-driven motion generation has achieved substantial progress with the emergence of diffusion models. However, existing methods still struggle to generate complex motion sequences that correspond to fine-grained descriptions, depicting detailed and accurate spatio-temporal actions. This lack of fine controllability limits the usage of motion generation to a larger audience. To tackle these challenges, we present FineMoGen, a diffusion-based motion generation and editing framework that can synthesize fine-grained motions, with spatial-temporal composition to the user instructions. Specifically, FineMoGen builds upon diffusion model with a novel transformer architecture dubbed Spatio-Temporal Mixture Attention (SAMI). SAMI optimizes the generation of the global attention template from two perspectives: 1) explicitly modeling the constraints of spatio-temporal composition; and 2) utilizing sparsely-activated mixture-of-experts to adaptively extract fine-grained features. To facilitate a large-scale study on this new fine-grained motion generation task, we contribute the HuMMan-MoGen dataset, which consists of 2,968 videos and 102,336 fine-grained spatio-temporal descriptions. Extensive experiments validate that FineMoGen exhibits superior motion generation quality over state-of-the-art methods. Notably, FineMoGen further enables zero-shot motion editing capabilities with the aid of modern large language models (LLM), which faithfully manipulates motion sequences with fine-grained instructions. Project Page: https://mingyuan-zhang.github.io/projects/FineMoGen.html
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Pose Tracking | HumanML3D | Diversity | 9.263 | FineMoGen |
| Pose Tracking | HumanML3D | FID | 0.151 | FineMoGen |
| Pose Tracking | HumanML3D | Multimodality | 2.696 | FineMoGen |
| Pose Tracking | HumanML3D | R Precision Top3 | 0.784 | FineMoGen |
| Pose Tracking | KIT Motion-Language | Diversity | 10.85 | FineMoGen |
| Pose Tracking | KIT Motion-Language | FID | 0.178 | FineMoGen |
| Pose Tracking | KIT Motion-Language | Multimodality | 1.877 | FineMoGen |
| Pose Tracking | KIT Motion-Language | R Precision Top3 | 0.772 | FineMoGen |
| Motion Synthesis | HumanML3D | Diversity | 9.263 | FineMoGen |
| Motion Synthesis | HumanML3D | FID | 0.151 | FineMoGen |
| Motion Synthesis | HumanML3D | Multimodality | 2.696 | FineMoGen |
| Motion Synthesis | HumanML3D | R Precision Top3 | 0.784 | FineMoGen |
| Motion Synthesis | KIT Motion-Language | Diversity | 10.85 | FineMoGen |
| Motion Synthesis | KIT Motion-Language | FID | 0.178 | FineMoGen |
| Motion Synthesis | KIT Motion-Language | Multimodality | 1.877 | FineMoGen |
| Motion Synthesis | KIT Motion-Language | R Precision Top3 | 0.772 | FineMoGen |
| 10-shot image generation | HumanML3D | Diversity | 9.263 | FineMoGen |
| 10-shot image generation | HumanML3D | FID | 0.151 | FineMoGen |
| 10-shot image generation | HumanML3D | Multimodality | 2.696 | FineMoGen |
| 10-shot image generation | HumanML3D | R Precision Top3 | 0.784 | FineMoGen |
| 10-shot image generation | KIT Motion-Language | Diversity | 10.85 | FineMoGen |
| 10-shot image generation | KIT Motion-Language | FID | 0.178 | FineMoGen |
| 10-shot image generation | KIT Motion-Language | Multimodality | 1.877 | FineMoGen |
| 10-shot image generation | KIT Motion-Language | R Precision Top3 | 0.772 | FineMoGen |
| 3D Human Pose Tracking | HumanML3D | Diversity | 9.263 | FineMoGen |
| 3D Human Pose Tracking | HumanML3D | FID | 0.151 | FineMoGen |
| 3D Human Pose Tracking | HumanML3D | Multimodality | 2.696 | FineMoGen |
| 3D Human Pose Tracking | HumanML3D | R Precision Top3 | 0.784 | FineMoGen |
| 3D Human Pose Tracking | KIT Motion-Language | Diversity | 10.85 | FineMoGen |
| 3D Human Pose Tracking | KIT Motion-Language | FID | 0.178 | FineMoGen |
| 3D Human Pose Tracking | KIT Motion-Language | Multimodality | 1.877 | FineMoGen |
| 3D Human Pose Tracking | KIT Motion-Language | R Precision Top3 | 0.772 | FineMoGen |