Qiran Zou, Shangyuan Yuan, Shian Du, Yu Wang, Chang Liu, Yi Xu, Jie Chen, Xiangyang Ji
We study a challenging task: text-to-motion synthesis, aiming to generate motions that align with textual descriptions and exhibit coordinated movements. Currently, the part-based methods introduce part partition into the motion synthesis process to achieve finer-grained generation. However, these methods encounter challenges such as the lack of coordination between different part motions and difficulties for networks to understand part concepts. Moreover, introducing finer-grained part concepts poses computational complexity challenges. In this paper, we propose Part-Coordinating Text-to-Motion Synthesis (ParCo), endowed with enhanced capabilities for understanding part motions and communication among different part motion generators, ensuring a coordinated and fined-grained motion synthesis. Specifically, we discretize whole-body motion into multiple part motions to establish the prior concept of different parts. Afterward, we employ multiple lightweight generators designed to synthesize different part motions and coordinate them through our part coordination module. Our approach demonstrates superior performance on common benchmarks with economic computations, including HumanML3D and KIT-ML, providing substantial evidence of its effectiveness. Code is available at https://github.com/qrzou/ParCo .
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Pose Tracking | HumanML3D | Diversity | 9.576 | ParCo |
| Pose Tracking | HumanML3D | FID | 0.109 | ParCo |
| Pose Tracking | HumanML3D | Multimodality | 1.382 | ParCo |
| Pose Tracking | HumanML3D | R Precision Top3 | 0.801 | ParCo |
| Pose Tracking | KIT Motion-Language | Diversity | 10.95 | ParCo |
| Pose Tracking | KIT Motion-Language | FID | 0.453 | ParCo |
| Pose Tracking | KIT Motion-Language | Multimodality | 1.245 | ParCo |
| Pose Tracking | KIT Motion-Language | R Precision Top3 | 0.772 | ParCo |
| Motion Synthesis | HumanML3D | Diversity | 9.576 | ParCo |
| Motion Synthesis | HumanML3D | FID | 0.109 | ParCo |
| Motion Synthesis | HumanML3D | Multimodality | 1.382 | ParCo |
| Motion Synthesis | HumanML3D | R Precision Top3 | 0.801 | ParCo |
| Motion Synthesis | KIT Motion-Language | Diversity | 10.95 | ParCo |
| Motion Synthesis | KIT Motion-Language | FID | 0.453 | ParCo |
| Motion Synthesis | KIT Motion-Language | Multimodality | 1.245 | ParCo |
| Motion Synthesis | KIT Motion-Language | R Precision Top3 | 0.772 | ParCo |
| 10-shot image generation | HumanML3D | Diversity | 9.576 | ParCo |
| 10-shot image generation | HumanML3D | FID | 0.109 | ParCo |
| 10-shot image generation | HumanML3D | Multimodality | 1.382 | ParCo |
| 10-shot image generation | HumanML3D | R Precision Top3 | 0.801 | ParCo |
| 10-shot image generation | KIT Motion-Language | Diversity | 10.95 | ParCo |
| 10-shot image generation | KIT Motion-Language | FID | 0.453 | ParCo |
| 10-shot image generation | KIT Motion-Language | Multimodality | 1.245 | ParCo |
| 10-shot image generation | KIT Motion-Language | R Precision Top3 | 0.772 | ParCo |
| 3D Human Pose Tracking | HumanML3D | Diversity | 9.576 | ParCo |
| 3D Human Pose Tracking | HumanML3D | FID | 0.109 | ParCo |
| 3D Human Pose Tracking | HumanML3D | Multimodality | 1.382 | ParCo |
| 3D Human Pose Tracking | HumanML3D | R Precision Top3 | 0.801 | ParCo |
| 3D Human Pose Tracking | KIT Motion-Language | Diversity | 10.95 | ParCo |
| 3D Human Pose Tracking | KIT Motion-Language | FID | 0.453 | ParCo |
| 3D Human Pose Tracking | KIT Motion-Language | Multimodality | 1.245 | ParCo |
| 3D Human Pose Tracking | KIT Motion-Language | R Precision Top3 | 0.772 | ParCo |