Zeyu Zhang, Akide Liu, Ian Reid, Richard Hartley, Bohan Zhuang, Hao Tang
Human motion generation stands as a significant pursuit in generative computer vision, while achieving long-sequence and efficient motion generation remains challenging. Recent advancements in state space models (SSMs), notably Mamba, have showcased considerable promise in long sequence modeling with an efficient hardware-aware design, which appears to be a promising direction to build motion generation model upon it. Nevertheless, adapting SSMs to motion generation faces hurdles since the lack of a specialized design architecture to model motion sequence. To address these challenges, we propose Motion Mamba, a simple and efficient approach that presents the pioneering motion generation model utilized SSMs. Specifically, we design a Hierarchical Temporal Mamba (HTM) block to process temporal data by ensemble varying numbers of isolated SSM modules across a symmetric U-Net architecture aimed at preserving motion consistency between frames. We also design a Bidirectional Spatial Mamba (BSM) block to bidirectionally process latent poses, to enhance accurate motion generation within a temporal frame. Our proposed method achieves up to 50% FID improvement and up to 4 times faster on the HumanML3D and KIT-ML datasets compared to the previous best diffusion-based method, which demonstrates strong capabilities of high-quality long sequence motion modeling and real-time human motion generation. See project website https://steve-zeyu-zhang.github.io/MotionMamba/
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Pose Tracking | HumanML3D | Diversity | 9.871 | Motion Mamba |
| Pose Tracking | HumanML3D | FID | 0.281 | Motion Mamba |
| Pose Tracking | HumanML3D | Multimodality | 2.294 | Motion Mamba |
| Pose Tracking | HumanML3D | R Precision Top3 | 0.792 | Motion Mamba |
| Pose Tracking | KIT Motion-Language | Diversity | 11.02 | Motion Mamba |
| Pose Tracking | KIT Motion-Language | FID | 0.307 | Motion Mamba |
| Pose Tracking | KIT Motion-Language | Multimodality | 1.678 | Motion Mamba |
| Pose Tracking | KIT Motion-Language | R Precision Top3 | 0.765 | Motion Mamba |
| Motion Synthesis | HumanML3D | Diversity | 9.871 | Motion Mamba |
| Motion Synthesis | HumanML3D | FID | 0.281 | Motion Mamba |
| Motion Synthesis | HumanML3D | Multimodality | 2.294 | Motion Mamba |
| Motion Synthesis | HumanML3D | R Precision Top3 | 0.792 | Motion Mamba |
| Motion Synthesis | KIT Motion-Language | Diversity | 11.02 | Motion Mamba |
| Motion Synthesis | KIT Motion-Language | FID | 0.307 | Motion Mamba |
| Motion Synthesis | KIT Motion-Language | Multimodality | 1.678 | Motion Mamba |
| Motion Synthesis | KIT Motion-Language | R Precision Top3 | 0.765 | Motion Mamba |
| 10-shot image generation | HumanML3D | Diversity | 9.871 | Motion Mamba |
| 10-shot image generation | HumanML3D | FID | 0.281 | Motion Mamba |
| 10-shot image generation | HumanML3D | Multimodality | 2.294 | Motion Mamba |
| 10-shot image generation | HumanML3D | R Precision Top3 | 0.792 | Motion Mamba |
| 10-shot image generation | KIT Motion-Language | Diversity | 11.02 | Motion Mamba |
| 10-shot image generation | KIT Motion-Language | FID | 0.307 | Motion Mamba |
| 10-shot image generation | KIT Motion-Language | Multimodality | 1.678 | Motion Mamba |
| 10-shot image generation | KIT Motion-Language | R Precision Top3 | 0.765 | Motion Mamba |
| 3D Human Pose Tracking | HumanML3D | Diversity | 9.871 | Motion Mamba |
| 3D Human Pose Tracking | HumanML3D | FID | 0.281 | Motion Mamba |
| 3D Human Pose Tracking | HumanML3D | Multimodality | 2.294 | Motion Mamba |
| 3D Human Pose Tracking | HumanML3D | R Precision Top3 | 0.792 | Motion Mamba |
| 3D Human Pose Tracking | KIT Motion-Language | Diversity | 11.02 | Motion Mamba |
| 3D Human Pose Tracking | KIT Motion-Language | FID | 0.307 | Motion Mamba |
| 3D Human Pose Tracking | KIT Motion-Language | Multimodality | 1.678 | Motion Mamba |
| 3D Human Pose Tracking | KIT Motion-Language | R Precision Top3 | 0.765 | Motion Mamba |