Jungbin Cho, Junwan Kim, Jisoo Kim, Minseo Kim, Mingu Kang, Sungeun Hong, Tae-Hyun Oh, Youngjae Yu
Human motion is inherently continuous and dynamic, posing significant challenges for generative models. While discrete generation methods are widely used, they suffer from limited expressiveness and frame-wise noise artifacts. In contrast, continuous approaches produce smoother, more natural motion but often struggle to adhere to conditioning signals due to high-dimensional complexity and limited training data. To resolve this discord between discrete and continuous representations, we introduce DisCoRD: Discrete Tokens to Continuous Motion via Rectified Flow Decoding, a novel method that leverages rectified flow to decode discrete motion tokens in the continuous, raw motion space. Our core idea is to frame token decoding as a conditional generation task, ensuring that DisCoRD captures fine-grained dynamics and achieves smoother, more natural motions. Compatible with any discrete-based framework, our method enhances naturalness without compromising faithfulness to the conditioning signals on diverse settings. Extensive evaluations Our project page is available at: https://whwjdqls.github.io/discord.github.io/.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Pose Tracking | HumanML3D | FID | 0.032 | DisCoRD (+MoMask) |
| Pose Tracking | HumanML3D | Multimodality | 1.288 | DisCoRD (+MoMask) |
| Pose Tracking | HumanML3D | R Precision Top3 | 0.809 | DisCoRD (+MoMask) |
| Pose Tracking | KIT Motion-Language | FID | 0.169 | DisCoRD (+MoMask) |
| Pose Tracking | KIT Motion-Language | Multimodality | 1.266 | DisCoRD (+MoMask) |
| Pose Tracking | KIT Motion-Language | R Precision Top3 | 0.775 | DisCoRD (+MoMask) |
| Motion Synthesis | HumanML3D | FID | 0.032 | DisCoRD (+MoMask) |
| Motion Synthesis | HumanML3D | Multimodality | 1.288 | DisCoRD (+MoMask) |
| Motion Synthesis | HumanML3D | R Precision Top3 | 0.809 | DisCoRD (+MoMask) |
| Motion Synthesis | KIT Motion-Language | FID | 0.169 | DisCoRD (+MoMask) |
| Motion Synthesis | KIT Motion-Language | Multimodality | 1.266 | DisCoRD (+MoMask) |
| Motion Synthesis | KIT Motion-Language | R Precision Top3 | 0.775 | DisCoRD (+MoMask) |
| 10-shot image generation | HumanML3D | FID | 0.032 | DisCoRD (+MoMask) |
| 10-shot image generation | HumanML3D | Multimodality | 1.288 | DisCoRD (+MoMask) |
| 10-shot image generation | HumanML3D | R Precision Top3 | 0.809 | DisCoRD (+MoMask) |
| 10-shot image generation | KIT Motion-Language | FID | 0.169 | DisCoRD (+MoMask) |
| 10-shot image generation | KIT Motion-Language | Multimodality | 1.266 | DisCoRD (+MoMask) |
| 10-shot image generation | KIT Motion-Language | R Precision Top3 | 0.775 | DisCoRD (+MoMask) |
| 3D Human Pose Tracking | HumanML3D | FID | 0.032 | DisCoRD (+MoMask) |
| 3D Human Pose Tracking | HumanML3D | Multimodality | 1.288 | DisCoRD (+MoMask) |
| 3D Human Pose Tracking | HumanML3D | R Precision Top3 | 0.809 | DisCoRD (+MoMask) |
| 3D Human Pose Tracking | KIT Motion-Language | FID | 0.169 | DisCoRD (+MoMask) |
| 3D Human Pose Tracking | KIT Motion-Language | Multimodality | 1.266 | DisCoRD (+MoMask) |
| 3D Human Pose Tracking | KIT Motion-Language | R Precision Top3 | 0.775 | DisCoRD (+MoMask) |