Lei Jiang, Ye Wei, Hao Ni
Diffusion models have become a popular choice for human motion synthesis due to their powerful generative capabilities. However, their high computational complexity and large sampling steps pose challenges for real-time applications. Fortunately, the Consistency Model (CM) provides a solution to greatly reduce the number of sampling steps from hundreds to a few, typically fewer than four, significantly accelerating the synthesis of diffusion models. However, its application to text-conditioned human motion synthesis in latent space remains challenging. In this paper, we introduce \textbf{MotionPCM}, a phased consistency model-based approach designed to improve the quality and efficiency of real-time motion synthesis in latent space.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Pose Tracking | HumanML3D | Diversity | 9.575 | MotionPCM |
| Pose Tracking | HumanML3D | FID | 0.03 | MotionPCM |
| Pose Tracking | HumanML3D | Multimodality | 1.714 | MotionPCM |
| Pose Tracking | HumanML3D | R Precision Top3 | 0.842 | MotionPCM |
| Pose Tracking | KIT Motion-Language | Diversity | 10.827 | MotionPCM |
| Pose Tracking | KIT Motion-Language | FID | 0.294 | MotionPCM |
| Pose Tracking | KIT Motion-Language | Multimodality | 1.254 | MotionPCM |
| Pose Tracking | KIT Motion-Language | R Precision Top3 | 0.787 | MotionPCM |
| Motion Synthesis | HumanML3D | Diversity | 9.575 | MotionPCM |
| Motion Synthesis | HumanML3D | FID | 0.03 | MotionPCM |
| Motion Synthesis | HumanML3D | Multimodality | 1.714 | MotionPCM |
| Motion Synthesis | HumanML3D | R Precision Top3 | 0.842 | MotionPCM |
| Motion Synthesis | KIT Motion-Language | Diversity | 10.827 | MotionPCM |
| Motion Synthesis | KIT Motion-Language | FID | 0.294 | MotionPCM |
| Motion Synthesis | KIT Motion-Language | Multimodality | 1.254 | MotionPCM |
| Motion Synthesis | KIT Motion-Language | R Precision Top3 | 0.787 | MotionPCM |
| 10-shot image generation | HumanML3D | Diversity | 9.575 | MotionPCM |
| 10-shot image generation | HumanML3D | FID | 0.03 | MotionPCM |
| 10-shot image generation | HumanML3D | Multimodality | 1.714 | MotionPCM |
| 10-shot image generation | HumanML3D | R Precision Top3 | 0.842 | MotionPCM |
| 10-shot image generation | KIT Motion-Language | Diversity | 10.827 | MotionPCM |
| 10-shot image generation | KIT Motion-Language | FID | 0.294 | MotionPCM |
| 10-shot image generation | KIT Motion-Language | Multimodality | 1.254 | MotionPCM |
| 10-shot image generation | KIT Motion-Language | R Precision Top3 | 0.787 | MotionPCM |
| 3D Human Pose Tracking | HumanML3D | Diversity | 9.575 | MotionPCM |
| 3D Human Pose Tracking | HumanML3D | FID | 0.03 | MotionPCM |
| 3D Human Pose Tracking | HumanML3D | Multimodality | 1.714 | MotionPCM |
| 3D Human Pose Tracking | HumanML3D | R Precision Top3 | 0.842 | MotionPCM |
| 3D Human Pose Tracking | KIT Motion-Language | Diversity | 10.827 | MotionPCM |
| 3D Human Pose Tracking | KIT Motion-Language | FID | 0.294 | MotionPCM |
| 3D Human Pose Tracking | KIT Motion-Language | Multimodality | 1.254 | MotionPCM |
| 3D Human Pose Tracking | KIT Motion-Language | R Precision Top3 | 0.787 | MotionPCM |