Mathis Petrovich, Michael J. Black, Gül Varol
We address the problem of generating diverse 3D human motions from textual descriptions. This challenging task requires joint modeling of both modalities: understanding and extracting useful human-centric information from the text, and then generating plausible and realistic sequences of human poses. In contrast to most previous work which focuses on generating a single, deterministic, motion from a textual description, we design a variational approach that can produce multiple diverse human motions. We propose TEMOS, a text-conditioned generative model leveraging variational autoencoder (VAE) training with human motion data, in combination with a text encoder that produces distribution parameters compatible with the VAE latent space. We show the TEMOS framework can produce both skeleton-based animations as in prior work, as well more expressive SMPL body motions. We evaluate our approach on the KIT Motion-Language benchmark and, despite being relatively straightforward, demonstrate significant improvements over the state of the art. Code and models are available on our webpage.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Pose Tracking | Inter-X | FID | 29.258 | TEMOS |
| Pose Tracking | Inter-X | MMDist | 6.867 | TEMOS |
| Pose Tracking | Inter-X | MModality | 0.672 | TEMOS |
| Pose Tracking | Inter-X | R-Precision Top3 | 0.238 | TEMOS |
| Pose Tracking | InterHuman | FID | 17.375 | TEMOS |
| Pose Tracking | InterHuman | MMDist | 6.342 | TEMOS |
| Pose Tracking | InterHuman | MModality | 0.535 | TEMOS |
| Pose Tracking | InterHuman | R-Precision Top3 | 0.45 | TEMOS |
| Motion Synthesis | Inter-X | FID | 29.258 | TEMOS |
| Motion Synthesis | Inter-X | MMDist | 6.867 | TEMOS |
| Motion Synthesis | Inter-X | MModality | 0.672 | TEMOS |
| Motion Synthesis | Inter-X | R-Precision Top3 | 0.238 | TEMOS |
| Motion Synthesis | InterHuman | FID | 17.375 | TEMOS |
| Motion Synthesis | InterHuman | MMDist | 6.342 | TEMOS |
| Motion Synthesis | InterHuman | MModality | 0.535 | TEMOS |
| Motion Synthesis | InterHuman | R-Precision Top3 | 0.45 | TEMOS |
| 10-shot image generation | Inter-X | FID | 29.258 | TEMOS |
| 10-shot image generation | Inter-X | MMDist | 6.867 | TEMOS |
| 10-shot image generation | Inter-X | MModality | 0.672 | TEMOS |
| 10-shot image generation | Inter-X | R-Precision Top3 | 0.238 | TEMOS |
| 10-shot image generation | InterHuman | FID | 17.375 | TEMOS |
| 10-shot image generation | InterHuman | MMDist | 6.342 | TEMOS |
| 10-shot image generation | InterHuman | MModality | 0.535 | TEMOS |
| 10-shot image generation | InterHuman | R-Precision Top3 | 0.45 | TEMOS |
| 3D Human Pose Tracking | Inter-X | FID | 29.258 | TEMOS |
| 3D Human Pose Tracking | Inter-X | MMDist | 6.867 | TEMOS |
| 3D Human Pose Tracking | Inter-X | MModality | 0.672 | TEMOS |
| 3D Human Pose Tracking | Inter-X | R-Precision Top3 | 0.238 | TEMOS |
| 3D Human Pose Tracking | InterHuman | FID | 17.375 | TEMOS |
| 3D Human Pose Tracking | InterHuman | MMDist | 6.342 | TEMOS |
| 3D Human Pose Tracking | InterHuman | MModality | 0.535 | TEMOS |
| 3D Human Pose Tracking | InterHuman | R-Precision Top3 | 0.45 | TEMOS |