Chongyang Zhong, Lei Hu, Zihao Zhang, Shihong Xia
Generating 3D human motion based on textual descriptions has been a research focus in recent years. It requires the generated motion to be diverse, natural, and conform to the textual description. Due to the complex spatio-temporal nature of human motion and the difficulty in learning the cross-modal relationship between text and motion, text-driven motion generation is still a challenging problem. To address these issues, we propose \textbf{AttT2M}, a two-stage method with multi-perspective attention mechanism: \textbf{body-part attention} and \textbf{global-local motion-text attention}. The former focuses on the motion embedding perspective, which means introducing a body-part spatio-temporal encoder into VQ-VAE to learn a more expressive discrete latent space. The latter is from the cross-modal perspective, which is used to learn the sentence-level and word-level motion-text cross-modal relationship. The text-driven motion is finally generated with a generative transformer. Extensive experiments conducted on HumanML3D and KIT-ML demonstrate that our method outperforms the current state-of-the-art works in terms of qualitative and quantitative evaluation, and achieve fine-grained synthesis and action2motion. Our code is in https://github.com/ZcyMonkey/AttT2M
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Pose Tracking | HumanML3D | Diversity | 9.7 | AttT2M |
| Pose Tracking | HumanML3D | FID | 0.112 | AttT2M |
| Pose Tracking | HumanML3D | Multimodality | 2.452 | AttT2M |
| Pose Tracking | HumanML3D | R Precision Top3 | 0.786 | AttT2M |
| Pose Tracking | KIT Motion-Language | Diversity | 10.96 | AttT2M |
| Pose Tracking | KIT Motion-Language | FID | 0.87 | AttT2M |
| Pose Tracking | KIT Motion-Language | Multimodality | 2.281 | AttT2M |
| Pose Tracking | KIT Motion-Language | R Precision Top3 | 0.751 | AttT2M |
| Motion Synthesis | HumanML3D | Diversity | 9.7 | AttT2M |
| Motion Synthesis | HumanML3D | FID | 0.112 | AttT2M |
| Motion Synthesis | HumanML3D | Multimodality | 2.452 | AttT2M |
| Motion Synthesis | HumanML3D | R Precision Top3 | 0.786 | AttT2M |
| Motion Synthesis | KIT Motion-Language | Diversity | 10.96 | AttT2M |
| Motion Synthesis | KIT Motion-Language | FID | 0.87 | AttT2M |
| Motion Synthesis | KIT Motion-Language | Multimodality | 2.281 | AttT2M |
| Motion Synthesis | KIT Motion-Language | R Precision Top3 | 0.751 | AttT2M |
| 10-shot image generation | HumanML3D | Diversity | 9.7 | AttT2M |
| 10-shot image generation | HumanML3D | FID | 0.112 | AttT2M |
| 10-shot image generation | HumanML3D | Multimodality | 2.452 | AttT2M |
| 10-shot image generation | HumanML3D | R Precision Top3 | 0.786 | AttT2M |
| 10-shot image generation | KIT Motion-Language | Diversity | 10.96 | AttT2M |
| 10-shot image generation | KIT Motion-Language | FID | 0.87 | AttT2M |
| 10-shot image generation | KIT Motion-Language | Multimodality | 2.281 | AttT2M |
| 10-shot image generation | KIT Motion-Language | R Precision Top3 | 0.751 | AttT2M |
| 3D Human Pose Tracking | HumanML3D | Diversity | 9.7 | AttT2M |
| 3D Human Pose Tracking | HumanML3D | FID | 0.112 | AttT2M |
| 3D Human Pose Tracking | HumanML3D | Multimodality | 2.452 | AttT2M |
| 3D Human Pose Tracking | HumanML3D | R Precision Top3 | 0.786 | AttT2M |
| 3D Human Pose Tracking | KIT Motion-Language | Diversity | 10.96 | AttT2M |
| 3D Human Pose Tracking | KIT Motion-Language | FID | 0.87 | AttT2M |
| 3D Human Pose Tracking | KIT Motion-Language | Multimodality | 2.281 | AttT2M |
| 3D Human Pose Tracking | KIT Motion-Language | R Precision Top3 | 0.751 | AttT2M |