Jianrong Zhang, Yangsong Zhang, Xiaodong Cun, Shaoli Huang, Yong Zhang, Hongwei Zhao, Hongtao Lu, Xi Shen
In this work, we investigate a simple and must-known conditional generative framework based on Vector Quantised-Variational AutoEncoder (VQ-VAE) and Generative Pre-trained Transformer (GPT) for human motion generation from textural descriptions. We show that a simple CNN-based VQ-VAE with commonly used training recipes (EMA and Code Reset) allows us to obtain high-quality discrete representations. For GPT, we incorporate a simple corruption strategy during the training to alleviate training-testing discrepancy. Despite its simplicity, our T2M-GPT shows better performance than competitive approaches, including recent diffusion-based approaches. For example, on HumanML3D, which is currently the largest dataset, we achieve comparable performance on the consistency between text and generated motion (R-Precision), but with FID 0.116 largely outperforming MotionDiffuse of 0.630. Additionally, we conduct analyses on HumanML3D and observe that the dataset size is a limitation of our approach. Our work suggests that VQ-VAE still remains a competitive approach for human motion generation.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Pose Tracking | HumanML3D | Diversity | 9.761 | T2M-GPT (τ = 0.5) |
| Pose Tracking | HumanML3D | FID | 0.116 | T2M-GPT (τ = 0.5) |
| Pose Tracking | HumanML3D | Multimodality | 1.856 | T2M-GPT (τ = 0.5) |
| Pose Tracking | HumanML3D | R Precision Top3 | 0.775 | T2M-GPT (τ = 0.5) |
| Pose Tracking | HumanML3D | Diversity | 9.844 | T2M-GPT (τ = 0) |
| Pose Tracking | HumanML3D | FID | 0.14 | T2M-GPT (τ = 0) |
| Pose Tracking | HumanML3D | Multimodality | 3.285 | T2M-GPT (τ = 0) |
| Pose Tracking | HumanML3D | R Precision Top3 | 0.685 | T2M-GPT (τ = 0) |
| Pose Tracking | HumanML3D | Diversity | 9.722 | T2M-GPT (τ ∈ U[0, 1]) |
| Pose Tracking | HumanML3D | FID | 0.141 | T2M-GPT (τ ∈ U[0, 1]) |
| Pose Tracking | HumanML3D | Multimodality | 1.831 | T2M-GPT (τ ∈ U[0, 1]) |
| Pose Tracking | HumanML3D | R Precision Top3 | 0.775 | T2M-GPT (τ ∈ U[0, 1]) |
| Pose Tracking | Motion-X | Diversity | 10.753 | T2M-GPT |
| Pose Tracking | Motion-X | FID | 1.366 | T2M-GPT |
| Pose Tracking | Motion-X | MModality | 2.356 | T2M-GPT |
| Pose Tracking | Motion-X | TMR-Matching Score | 0.881 | T2M-GPT |
| Pose Tracking | Motion-X | TMR-R-Precision Top3 | 0.655 | T2M-GPT |
| Pose Tracking | KIT Motion-Language | Diversity | 10.921 | T2M-GPT (τ ∈ U[0, 1]) |
| Pose Tracking | KIT Motion-Language | FID | 0.514 | T2M-GPT (τ ∈ U[0, 1]) |
| Pose Tracking | KIT Motion-Language | Multimodality | 1.57 | T2M-GPT (τ ∈ U[0, 1]) |
| Pose Tracking | KIT Motion-Language | R Precision Top3 | 0.745 | T2M-GPT (τ ∈ U[0, 1]) |
| Pose Tracking | KIT Motion-Language | Diversity | 10.862 | T2M-GPT (τ = 0.5) |
| Pose Tracking | KIT Motion-Language | FID | 0.717 | T2M-GPT (τ = 0.5) |
| Pose Tracking | KIT Motion-Language | Multimodality | 1.912 | T2M-GPT (τ = 0.5) |
| Pose Tracking | KIT Motion-Language | R Precision Top3 | 0.737 | T2M-GPT (τ = 0.5) |
| Pose Tracking | KIT Motion-Language | Diversity | 11.198 | T2M-GPT (τ = 0) |
| Pose Tracking | KIT Motion-Language | FID | 0.737 | T2M-GPT (τ = 0) |
| Pose Tracking | KIT Motion-Language | Multimodality | 2.309 | T2M-GPT (τ = 0) |
| Pose Tracking | KIT Motion-Language | R Precision Top3 | 0.716 | T2M-GPT (τ = 0) |
| Motion Synthesis | HumanML3D | Diversity | 9.761 | T2M-GPT (τ = 0.5) |
| Motion Synthesis | HumanML3D | FID | 0.116 | T2M-GPT (τ = 0.5) |
| Motion Synthesis | HumanML3D | Multimodality | 1.856 | T2M-GPT (τ = 0.5) |
| Motion Synthesis | HumanML3D | R Precision Top3 | 0.775 | T2M-GPT (τ = 0.5) |
| Motion Synthesis | HumanML3D | Diversity | 9.844 | T2M-GPT (τ = 0) |
| Motion Synthesis | HumanML3D | FID | 0.14 | T2M-GPT (τ = 0) |
| Motion Synthesis | HumanML3D | Multimodality | 3.285 | T2M-GPT (τ = 0) |
| Motion Synthesis | HumanML3D | R Precision Top3 | 0.685 | T2M-GPT (τ = 0) |
| Motion Synthesis | HumanML3D | Diversity | 9.722 | T2M-GPT (τ ∈ U[0, 1]) |
| Motion Synthesis | HumanML3D | FID | 0.141 | T2M-GPT (τ ∈ U[0, 1]) |
| Motion Synthesis | HumanML3D | Multimodality | 1.831 | T2M-GPT (τ ∈ U[0, 1]) |
| Motion Synthesis | HumanML3D | R Precision Top3 | 0.775 | T2M-GPT (τ ∈ U[0, 1]) |
| Motion Synthesis | Motion-X | Diversity | 10.753 | T2M-GPT |
| Motion Synthesis | Motion-X | FID | 1.366 | T2M-GPT |
| Motion Synthesis | Motion-X | MModality | 2.356 | T2M-GPT |
| Motion Synthesis | Motion-X | TMR-Matching Score | 0.881 | T2M-GPT |
| Motion Synthesis | Motion-X | TMR-R-Precision Top3 | 0.655 | T2M-GPT |
| Motion Synthesis | KIT Motion-Language | Diversity | 10.921 | T2M-GPT (τ ∈ U[0, 1]) |
| Motion Synthesis | KIT Motion-Language | FID | 0.514 | T2M-GPT (τ ∈ U[0, 1]) |
| Motion Synthesis | KIT Motion-Language | Multimodality | 1.57 | T2M-GPT (τ ∈ U[0, 1]) |
| Motion Synthesis | KIT Motion-Language | R Precision Top3 | 0.745 | T2M-GPT (τ ∈ U[0, 1]) |
| Motion Synthesis | KIT Motion-Language | Diversity | 10.862 | T2M-GPT (τ = 0.5) |
| Motion Synthesis | KIT Motion-Language | FID | 0.717 | T2M-GPT (τ = 0.5) |
| Motion Synthesis | KIT Motion-Language | Multimodality | 1.912 | T2M-GPT (τ = 0.5) |
| Motion Synthesis | KIT Motion-Language | R Precision Top3 | 0.737 | T2M-GPT (τ = 0.5) |
| Motion Synthesis | KIT Motion-Language | Diversity | 11.198 | T2M-GPT (τ = 0) |
| Motion Synthesis | KIT Motion-Language | FID | 0.737 | T2M-GPT (τ = 0) |
| Motion Synthesis | KIT Motion-Language | Multimodality | 2.309 | T2M-GPT (τ = 0) |
| Motion Synthesis | KIT Motion-Language | R Precision Top3 | 0.716 | T2M-GPT (τ = 0) |
| 10-shot image generation | HumanML3D | Diversity | 9.761 | T2M-GPT (τ = 0.5) |
| 10-shot image generation | HumanML3D | FID | 0.116 | T2M-GPT (τ = 0.5) |
| 10-shot image generation | HumanML3D | Multimodality | 1.856 | T2M-GPT (τ = 0.5) |
| 10-shot image generation | HumanML3D | R Precision Top3 | 0.775 | T2M-GPT (τ = 0.5) |
| 10-shot image generation | HumanML3D | Diversity | 9.844 | T2M-GPT (τ = 0) |
| 10-shot image generation | HumanML3D | FID | 0.14 | T2M-GPT (τ = 0) |
| 10-shot image generation | HumanML3D | Multimodality | 3.285 | T2M-GPT (τ = 0) |
| 10-shot image generation | HumanML3D | R Precision Top3 | 0.685 | T2M-GPT (τ = 0) |
| 10-shot image generation | HumanML3D | Diversity | 9.722 | T2M-GPT (τ ∈ U[0, 1]) |
| 10-shot image generation | HumanML3D | FID | 0.141 | T2M-GPT (τ ∈ U[0, 1]) |
| 10-shot image generation | HumanML3D | Multimodality | 1.831 | T2M-GPT (τ ∈ U[0, 1]) |
| 10-shot image generation | HumanML3D | R Precision Top3 | 0.775 | T2M-GPT (τ ∈ U[0, 1]) |
| 10-shot image generation | Motion-X | Diversity | 10.753 | T2M-GPT |
| 10-shot image generation | Motion-X | FID | 1.366 | T2M-GPT |
| 10-shot image generation | Motion-X | MModality | 2.356 | T2M-GPT |
| 10-shot image generation | Motion-X | TMR-Matching Score | 0.881 | T2M-GPT |
| 10-shot image generation | Motion-X | TMR-R-Precision Top3 | 0.655 | T2M-GPT |
| 10-shot image generation | KIT Motion-Language | Diversity | 10.921 | T2M-GPT (τ ∈ U[0, 1]) |
| 10-shot image generation | KIT Motion-Language | FID | 0.514 | T2M-GPT (τ ∈ U[0, 1]) |
| 10-shot image generation | KIT Motion-Language | Multimodality | 1.57 | T2M-GPT (τ ∈ U[0, 1]) |
| 10-shot image generation | KIT Motion-Language | R Precision Top3 | 0.745 | T2M-GPT (τ ∈ U[0, 1]) |
| 10-shot image generation | KIT Motion-Language | Diversity | 10.862 | T2M-GPT (τ = 0.5) |
| 10-shot image generation | KIT Motion-Language | FID | 0.717 | T2M-GPT (τ = 0.5) |
| 10-shot image generation | KIT Motion-Language | Multimodality | 1.912 | T2M-GPT (τ = 0.5) |
| 10-shot image generation | KIT Motion-Language | R Precision Top3 | 0.737 | T2M-GPT (τ = 0.5) |
| 10-shot image generation | KIT Motion-Language | Diversity | 11.198 | T2M-GPT (τ = 0) |
| 10-shot image generation | KIT Motion-Language | FID | 0.737 | T2M-GPT (τ = 0) |
| 10-shot image generation | KIT Motion-Language | Multimodality | 2.309 | T2M-GPT (τ = 0) |
| 10-shot image generation | KIT Motion-Language | R Precision Top3 | 0.716 | T2M-GPT (τ = 0) |
| 3D Human Pose Tracking | HumanML3D | Diversity | 9.761 | T2M-GPT (τ = 0.5) |
| 3D Human Pose Tracking | HumanML3D | FID | 0.116 | T2M-GPT (τ = 0.5) |
| 3D Human Pose Tracking | HumanML3D | Multimodality | 1.856 | T2M-GPT (τ = 0.5) |
| 3D Human Pose Tracking | HumanML3D | R Precision Top3 | 0.775 | T2M-GPT (τ = 0.5) |
| 3D Human Pose Tracking | HumanML3D | Diversity | 9.844 | T2M-GPT (τ = 0) |
| 3D Human Pose Tracking | HumanML3D | FID | 0.14 | T2M-GPT (τ = 0) |
| 3D Human Pose Tracking | HumanML3D | Multimodality | 3.285 | T2M-GPT (τ = 0) |
| 3D Human Pose Tracking | HumanML3D | R Precision Top3 | 0.685 | T2M-GPT (τ = 0) |
| 3D Human Pose Tracking | HumanML3D | Diversity | 9.722 | T2M-GPT (τ ∈ U[0, 1]) |
| 3D Human Pose Tracking | HumanML3D | FID | 0.141 | T2M-GPT (τ ∈ U[0, 1]) |
| 3D Human Pose Tracking | HumanML3D | Multimodality | 1.831 | T2M-GPT (τ ∈ U[0, 1]) |
| 3D Human Pose Tracking | HumanML3D | R Precision Top3 | 0.775 | T2M-GPT (τ ∈ U[0, 1]) |
| 3D Human Pose Tracking | Motion-X | Diversity | 10.753 | T2M-GPT |
| 3D Human Pose Tracking | Motion-X | FID | 1.366 | T2M-GPT |
| 3D Human Pose Tracking | Motion-X | MModality | 2.356 | T2M-GPT |
| 3D Human Pose Tracking | Motion-X | TMR-Matching Score | 0.881 | T2M-GPT |
| 3D Human Pose Tracking | Motion-X | TMR-R-Precision Top3 | 0.655 | T2M-GPT |
| 3D Human Pose Tracking | KIT Motion-Language | Diversity | 10.921 | T2M-GPT (τ ∈ U[0, 1]) |
| 3D Human Pose Tracking | KIT Motion-Language | FID | 0.514 | T2M-GPT (τ ∈ U[0, 1]) |
| 3D Human Pose Tracking | KIT Motion-Language | Multimodality | 1.57 | T2M-GPT (τ ∈ U[0, 1]) |
| 3D Human Pose Tracking | KIT Motion-Language | R Precision Top3 | 0.745 | T2M-GPT (τ ∈ U[0, 1]) |
| 3D Human Pose Tracking | KIT Motion-Language | Diversity | 10.862 | T2M-GPT (τ = 0.5) |
| 3D Human Pose Tracking | KIT Motion-Language | FID | 0.717 | T2M-GPT (τ = 0.5) |
| 3D Human Pose Tracking | KIT Motion-Language | Multimodality | 1.912 | T2M-GPT (τ = 0.5) |
| 3D Human Pose Tracking | KIT Motion-Language | R Precision Top3 | 0.737 | T2M-GPT (τ = 0.5) |
| 3D Human Pose Tracking | KIT Motion-Language | Diversity | 11.198 | T2M-GPT (τ = 0) |
| 3D Human Pose Tracking | KIT Motion-Language | FID | 0.737 | T2M-GPT (τ = 0) |
| 3D Human Pose Tracking | KIT Motion-Language | Multimodality | 2.309 | T2M-GPT (τ = 0) |
| 3D Human Pose Tracking | KIT Motion-Language | R Precision Top3 | 0.716 | T2M-GPT (τ = 0) |