Guy Tevet, Sigal Raab, Brian Gordon, Yonatan Shafir, Daniel Cohen-Or, Amit H. Bermano
Natural and expressive human motion generation is the holy grail of computer animation. It is a challenging task, due to the diversity of possible motion, human perceptual sensitivity to it, and the difficulty of accurately describing it. Therefore, current generative solutions are either low-quality or limited in expressiveness. Diffusion models, which have already shown remarkable generative capabilities in other domains, are promising candidates for human motion due to their many-to-many nature, but they tend to be resource hungry and hard to control. In this paper, we introduce Motion Diffusion Model (MDM), a carefully adapted classifier-free diffusion-based generative model for the human motion domain. MDM is transformer-based, combining insights from motion generation literature. A notable design-choice is the prediction of the sample, rather than the noise, in each diffusion step. This facilitates the use of established geometric losses on the locations and velocities of the motion, such as the foot contact loss. As we demonstrate, MDM is a generic approach, enabling different modes of conditioning, and different generation tasks. We show that our model is trained with lightweight resources and yet achieves state-of-the-art results on leading benchmarks for text-to-motion and action-to-motion. https://guytevet.github.io/mdm-page/ .
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Image Generation | E.T. the Exceptional Trajectories | ClaTr-Score | 18.32 | MDM |
| Image Generation | E.T. the Exceptional Trajectories | Classifier-F1 | 0.34 | MDM |
| Image Generation | E.T. the Exceptional Trajectories | FD_ClaTr | 6.79 | MDM |
| Pose Tracking | HumanML3D | Diversity | 9.559 | MDM |
| Pose Tracking | HumanML3D | FID | 0.544 | MDM |
| Pose Tracking | HumanML3D | Multimodality | 2.799 | MDM |
| Pose Tracking | HumanML3D | R Precision Top3 | 0.611 | MDM |
| Pose Tracking | Inter-X | FID | 23.701 | MDM |
| Pose Tracking | Inter-X | MMDist | 9.548 | MDM |
| Pose Tracking | Inter-X | MModality | 3.49 | MDM |
| Pose Tracking | Inter-X | R-Precision Top3 | 0.426 | MDM |
| Pose Tracking | InterHuman | FID | 9.167 | MDM |
| Pose Tracking | InterHuman | MMDist | 7.125 | MDM |
| Pose Tracking | InterHuman | MModality | 2.35 | MDM |
| Pose Tracking | InterHuman | R-Precision Top3 | 0.339 | MDM |
| Pose Tracking | Motion-X | Diversity | 11.4 | MDM |
| Pose Tracking | Motion-X | FID | 3.8 | MDM |
| Pose Tracking | Motion-X | MModality | 2.53 | MDM |
| Pose Tracking | Motion-X | TMR-Matching Score | 0.84 | MDM |
| Pose Tracking | Motion-X | TMR-R-Precision Top3 | 0.6341 | MDM |
| Pose Tracking | HumanAct12 | Accuracy | 0.99 | MDM |
| Pose Tracking | HumanAct12 | FID | 0.08 | MDM |
| Pose Tracking | HumanAct12 | Multimodality | 2.58 | MDM |
| Pose Tracking | KIT Motion-Language | Diversity | 10.847 | MDM |
| Pose Tracking | KIT Motion-Language | FID | 0.497 | MDM |
| Pose Tracking | KIT Motion-Language | Multimodality | 1.907 | MDM |
| Pose Tracking | KIT Motion-Language | R Precision Top3 | 0.396 | MDM |
| Motion Synthesis | HumanML3D | Diversity | 9.559 | MDM |
| Motion Synthesis | HumanML3D | FID | 0.544 | MDM |
| Motion Synthesis | HumanML3D | Multimodality | 2.799 | MDM |
| Motion Synthesis | HumanML3D | R Precision Top3 | 0.611 | MDM |
| Motion Synthesis | Inter-X | FID | 23.701 | MDM |
| Motion Synthesis | Inter-X | MMDist | 9.548 | MDM |
| Motion Synthesis | Inter-X | MModality | 3.49 | MDM |
| Motion Synthesis | Inter-X | R-Precision Top3 | 0.426 | MDM |
| Motion Synthesis | InterHuman | FID | 9.167 | MDM |
| Motion Synthesis | InterHuman | MMDist | 7.125 | MDM |
| Motion Synthesis | InterHuman | MModality | 2.35 | MDM |
| Motion Synthesis | InterHuman | R-Precision Top3 | 0.339 | MDM |
| Motion Synthesis | Motion-X | Diversity | 11.4 | MDM |
| Motion Synthesis | Motion-X | FID | 3.8 | MDM |
| Motion Synthesis | Motion-X | MModality | 2.53 | MDM |
| Motion Synthesis | Motion-X | TMR-Matching Score | 0.84 | MDM |
| Motion Synthesis | Motion-X | TMR-R-Precision Top3 | 0.6341 | MDM |
| Motion Synthesis | HumanAct12 | Accuracy | 0.99 | MDM |
| Motion Synthesis | HumanAct12 | FID | 0.08 | MDM |
| Motion Synthesis | HumanAct12 | Multimodality | 2.58 | MDM |
| Motion Synthesis | KIT Motion-Language | Diversity | 10.847 | MDM |
| Motion Synthesis | KIT Motion-Language | FID | 0.497 | MDM |
| Motion Synthesis | KIT Motion-Language | Multimodality | 1.907 | MDM |
| Motion Synthesis | KIT Motion-Language | R Precision Top3 | 0.396 | MDM |
| 10-shot image generation | HumanML3D | Diversity | 9.559 | MDM |
| 10-shot image generation | HumanML3D | FID | 0.544 | MDM |
| 10-shot image generation | HumanML3D | Multimodality | 2.799 | MDM |
| 10-shot image generation | HumanML3D | R Precision Top3 | 0.611 | MDM |
| 10-shot image generation | Inter-X | FID | 23.701 | MDM |
| 10-shot image generation | Inter-X | MMDist | 9.548 | MDM |
| 10-shot image generation | Inter-X | MModality | 3.49 | MDM |
| 10-shot image generation | Inter-X | R-Precision Top3 | 0.426 | MDM |
| 10-shot image generation | InterHuman | FID | 9.167 | MDM |
| 10-shot image generation | InterHuman | MMDist | 7.125 | MDM |
| 10-shot image generation | InterHuman | MModality | 2.35 | MDM |
| 10-shot image generation | InterHuman | R-Precision Top3 | 0.339 | MDM |
| 10-shot image generation | Motion-X | Diversity | 11.4 | MDM |
| 10-shot image generation | Motion-X | FID | 3.8 | MDM |
| 10-shot image generation | Motion-X | MModality | 2.53 | MDM |
| 10-shot image generation | Motion-X | TMR-Matching Score | 0.84 | MDM |
| 10-shot image generation | Motion-X | TMR-R-Precision Top3 | 0.6341 | MDM |
| 10-shot image generation | HumanAct12 | Accuracy | 0.99 | MDM |
| 10-shot image generation | HumanAct12 | FID | 0.08 | MDM |
| 10-shot image generation | HumanAct12 | Multimodality | 2.58 | MDM |
| 10-shot image generation | KIT Motion-Language | Diversity | 10.847 | MDM |
| 10-shot image generation | KIT Motion-Language | FID | 0.497 | MDM |
| 10-shot image generation | KIT Motion-Language | Multimodality | 1.907 | MDM |
| 10-shot image generation | KIT Motion-Language | R Precision Top3 | 0.396 | MDM |
| 3D Human Pose Tracking | HumanML3D | Diversity | 9.559 | MDM |
| 3D Human Pose Tracking | HumanML3D | FID | 0.544 | MDM |
| 3D Human Pose Tracking | HumanML3D | Multimodality | 2.799 | MDM |
| 3D Human Pose Tracking | HumanML3D | R Precision Top3 | 0.611 | MDM |
| 3D Human Pose Tracking | Inter-X | FID | 23.701 | MDM |
| 3D Human Pose Tracking | Inter-X | MMDist | 9.548 | MDM |
| 3D Human Pose Tracking | Inter-X | MModality | 3.49 | MDM |
| 3D Human Pose Tracking | Inter-X | R-Precision Top3 | 0.426 | MDM |
| 3D Human Pose Tracking | InterHuman | FID | 9.167 | MDM |
| 3D Human Pose Tracking | InterHuman | MMDist | 7.125 | MDM |
| 3D Human Pose Tracking | InterHuman | MModality | 2.35 | MDM |
| 3D Human Pose Tracking | InterHuman | R-Precision Top3 | 0.339 | MDM |
| 3D Human Pose Tracking | Motion-X | Diversity | 11.4 | MDM |
| 3D Human Pose Tracking | Motion-X | FID | 3.8 | MDM |
| 3D Human Pose Tracking | Motion-X | MModality | 2.53 | MDM |
| 3D Human Pose Tracking | Motion-X | TMR-Matching Score | 0.84 | MDM |
| 3D Human Pose Tracking | Motion-X | TMR-R-Precision Top3 | 0.6341 | MDM |
| 3D Human Pose Tracking | HumanAct12 | Accuracy | 0.99 | MDM |
| 3D Human Pose Tracking | HumanAct12 | FID | 0.08 | MDM |
| 3D Human Pose Tracking | HumanAct12 | Multimodality | 2.58 | MDM |
| 3D Human Pose Tracking | KIT Motion-Language | Diversity | 10.847 | MDM |
| 3D Human Pose Tracking | KIT Motion-Language | FID | 0.497 | MDM |
| 3D Human Pose Tracking | KIT Motion-Language | Multimodality | 1.907 | MDM |
| 3D Human Pose Tracking | KIT Motion-Language | R Precision Top3 | 0.396 | MDM |
| 3D Generation | E.T. the Exceptional Trajectories | ClaTr-Score | 18.32 | MDM |
| 3D Generation | E.T. the Exceptional Trajectories | Classifier-F1 | 0.34 | MDM |
| 3D Generation | E.T. the Exceptional Trajectories | FD_ClaTr | 6.79 | MDM |