Yin Wang, Zhiying Leng, Frederick W. B. Li, Shun-Cheng Wu, Xiaohui Liang
Text-driven human motion generation in computer vision is both significant and challenging. However, current methods are limited to producing either deterministic or imprecise motion sequences, failing to effectively control the temporal and spatial relationships required to conform to a given text description. In this work, we propose a fine-grained method for generating high-quality, conditional human motion sequences supporting precise text description. Our approach consists of two key components: 1) a linguistics-structure assisted module that constructs accurate and complete language feature to fully utilize text information; and 2) a context-aware progressive reasoning module that learns neighborhood and overall semantic linguistics features from shallow and deep graph neural networks to achieve a multi-step inference. Experiments show that our approach outperforms text-driven motion generation methods on HumanML3D and KIT test sets and generates better visually confirmed motion to the text conditions.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Pose Tracking | HumanML3D | Diversity | 9.278 | Fg-T2M |
| Pose Tracking | HumanML3D | FID | 0.243 | Fg-T2M |
| Pose Tracking | HumanML3D | Multimodality | 1.614 | Fg-T2M |
| Pose Tracking | HumanML3D | R Precision Top3 | 0.783 | Fg-T2M |
| Pose Tracking | KIT Motion-Language | Diversity | 10.93 | Fg-T2M |
| Pose Tracking | KIT Motion-Language | FID | 0.571 | Fg-T2M |
| Pose Tracking | KIT Motion-Language | Multimodality | 1.019 | Fg-T2M |
| Pose Tracking | KIT Motion-Language | R Precision Top3 | 0.745 | Fg-T2M |
| Motion Synthesis | HumanML3D | Diversity | 9.278 | Fg-T2M |
| Motion Synthesis | HumanML3D | FID | 0.243 | Fg-T2M |
| Motion Synthesis | HumanML3D | Multimodality | 1.614 | Fg-T2M |
| Motion Synthesis | HumanML3D | R Precision Top3 | 0.783 | Fg-T2M |
| Motion Synthesis | KIT Motion-Language | Diversity | 10.93 | Fg-T2M |
| Motion Synthesis | KIT Motion-Language | FID | 0.571 | Fg-T2M |
| Motion Synthesis | KIT Motion-Language | Multimodality | 1.019 | Fg-T2M |
| Motion Synthesis | KIT Motion-Language | R Precision Top3 | 0.745 | Fg-T2M |
| 10-shot image generation | HumanML3D | Diversity | 9.278 | Fg-T2M |
| 10-shot image generation | HumanML3D | FID | 0.243 | Fg-T2M |
| 10-shot image generation | HumanML3D | Multimodality | 1.614 | Fg-T2M |
| 10-shot image generation | HumanML3D | R Precision Top3 | 0.783 | Fg-T2M |
| 10-shot image generation | KIT Motion-Language | Diversity | 10.93 | Fg-T2M |
| 10-shot image generation | KIT Motion-Language | FID | 0.571 | Fg-T2M |
| 10-shot image generation | KIT Motion-Language | Multimodality | 1.019 | Fg-T2M |
| 10-shot image generation | KIT Motion-Language | R Precision Top3 | 0.745 | Fg-T2M |
| 3D Human Pose Tracking | HumanML3D | Diversity | 9.278 | Fg-T2M |
| 3D Human Pose Tracking | HumanML3D | FID | 0.243 | Fg-T2M |
| 3D Human Pose Tracking | HumanML3D | Multimodality | 1.614 | Fg-T2M |
| 3D Human Pose Tracking | HumanML3D | R Precision Top3 | 0.783 | Fg-T2M |
| 3D Human Pose Tracking | KIT Motion-Language | Diversity | 10.93 | Fg-T2M |
| 3D Human Pose Tracking | KIT Motion-Language | FID | 0.571 | Fg-T2M |
| 3D Human Pose Tracking | KIT Motion-Language | Multimodality | 1.019 | Fg-T2M |
| 3D Human Pose Tracking | KIT Motion-Language | R Precision Top3 | 0.745 | Fg-T2M |