TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Executing your Commands via Motion Diffusion in Latent Space

Executing your Commands via Motion Diffusion in Latent Space

Xin Chen, Biao Jiang, Wen Liu, Zilong Huang, Bin Fu, Tao Chen, Jingyi Yu, Gang Yu

2022-12-08CVPR 2023 1Motion GenerationMotion Synthesis
PaperPDFCode(official)

Abstract

We study a challenging task, conditional human motion generation, which produces plausible human motion sequences according to various conditional inputs, such as action classes or textual descriptors. Since human motions are highly diverse and have a property of quite different distribution from conditional modalities, such as textual descriptors in natural languages, it is hard to learn a probabilistic mapping from the desired conditional modality to the human motion sequences. Besides, the raw motion data from the motion capture system might be redundant in sequences and contain noises; directly modeling the joint distribution over the raw motion sequences and conditional modalities would need a heavy computational overhead and might result in artifacts introduced by the captured noises. To learn a better representation of the various human motion sequences, we first design a powerful Variational AutoEncoder (VAE) and arrive at a representative and low-dimensional latent code for a human motion sequence. Then, instead of using a diffusion model to establish the connections between the raw motion sequences and the conditional inputs, we perform a diffusion process on the motion latent space. Our proposed Motion Latent-based Diffusion model (MLD) could produce vivid motion sequences conforming to the given conditional inputs and substantially reduce the computational overhead in both the training and inference stages. Extensive experiments on various human motion generation tasks demonstrate that our MLD achieves significant improvements over the state-of-the-art methods among extensive human motion generation tasks, with two orders of magnitude faster than previous diffusion models on raw motion sequences.

Results

TaskDatasetMetricValueModel
Pose TrackingHumanML3DDiversity9.724MLD
Pose TrackingHumanML3DFID0.473MLD
Pose TrackingHumanML3DMultimodality2.413MLD
Pose TrackingHumanML3DR Precision Top30.772MLD
Pose TrackingMotion-XDiversity10.42MLD
Pose TrackingMotion-XFID3.407MLD
Pose TrackingMotion-XMModality2.448MLD
Pose TrackingMotion-XTMR-Matching Score0.883MLD
Pose TrackingMotion-XTMR-R-Precision Top30.683MLD
Pose TrackingHumanAct12Accuracy0.964MLD
Pose TrackingHumanAct12FID0.077MLD
Pose TrackingHumanAct12Multimodality2.824MLD
Pose TrackingKIT Motion-LanguageDiversity10.8MLD
Pose TrackingKIT Motion-LanguageFID0.404MLD
Pose TrackingKIT Motion-LanguageMultimodality2.192MLD
Pose TrackingKIT Motion-LanguageR Precision Top30.734MLD
Pose TrackingKIT Motion-LanguageDiversity10.84TEMOS
Pose TrackingKIT Motion-LanguageFID3.717TEMOS
Pose TrackingKIT Motion-LanguageMultimodality0.532TEMOS
Pose TrackingKIT Motion-LanguageR Precision Top30.687TEMOS
Motion SynthesisHumanML3DDiversity9.724MLD
Motion SynthesisHumanML3DFID0.473MLD
Motion SynthesisHumanML3DMultimodality2.413MLD
Motion SynthesisHumanML3DR Precision Top30.772MLD
Motion SynthesisMotion-XDiversity10.42MLD
Motion SynthesisMotion-XFID3.407MLD
Motion SynthesisMotion-XMModality2.448MLD
Motion SynthesisMotion-XTMR-Matching Score0.883MLD
Motion SynthesisMotion-XTMR-R-Precision Top30.683MLD
Motion SynthesisHumanAct12Accuracy0.964MLD
Motion SynthesisHumanAct12FID0.077MLD
Motion SynthesisHumanAct12Multimodality2.824MLD
Motion SynthesisKIT Motion-LanguageDiversity10.8MLD
Motion SynthesisKIT Motion-LanguageFID0.404MLD
Motion SynthesisKIT Motion-LanguageMultimodality2.192MLD
Motion SynthesisKIT Motion-LanguageR Precision Top30.734MLD
Motion SynthesisKIT Motion-LanguageDiversity10.84TEMOS
Motion SynthesisKIT Motion-LanguageFID3.717TEMOS
Motion SynthesisKIT Motion-LanguageMultimodality0.532TEMOS
Motion SynthesisKIT Motion-LanguageR Precision Top30.687TEMOS
10-shot image generationHumanML3DDiversity9.724MLD
10-shot image generationHumanML3DFID0.473MLD
10-shot image generationHumanML3DMultimodality2.413MLD
10-shot image generationHumanML3DR Precision Top30.772MLD
10-shot image generationMotion-XDiversity10.42MLD
10-shot image generationMotion-XFID3.407MLD
10-shot image generationMotion-XMModality2.448MLD
10-shot image generationMotion-XTMR-Matching Score0.883MLD
10-shot image generationMotion-XTMR-R-Precision Top30.683MLD
10-shot image generationHumanAct12Accuracy0.964MLD
10-shot image generationHumanAct12FID0.077MLD
10-shot image generationHumanAct12Multimodality2.824MLD
10-shot image generationKIT Motion-LanguageDiversity10.8MLD
10-shot image generationKIT Motion-LanguageFID0.404MLD
10-shot image generationKIT Motion-LanguageMultimodality2.192MLD
10-shot image generationKIT Motion-LanguageR Precision Top30.734MLD
10-shot image generationKIT Motion-LanguageDiversity10.84TEMOS
10-shot image generationKIT Motion-LanguageFID3.717TEMOS
10-shot image generationKIT Motion-LanguageMultimodality0.532TEMOS
10-shot image generationKIT Motion-LanguageR Precision Top30.687TEMOS
3D Human Pose TrackingHumanML3DDiversity9.724MLD
3D Human Pose TrackingHumanML3DFID0.473MLD
3D Human Pose TrackingHumanML3DMultimodality2.413MLD
3D Human Pose TrackingHumanML3DR Precision Top30.772MLD
3D Human Pose TrackingMotion-XDiversity10.42MLD
3D Human Pose TrackingMotion-XFID3.407MLD
3D Human Pose TrackingMotion-XMModality2.448MLD
3D Human Pose TrackingMotion-XTMR-Matching Score0.883MLD
3D Human Pose TrackingMotion-XTMR-R-Precision Top30.683MLD
3D Human Pose TrackingHumanAct12Accuracy0.964MLD
3D Human Pose TrackingHumanAct12FID0.077MLD
3D Human Pose TrackingHumanAct12Multimodality2.824MLD
3D Human Pose TrackingKIT Motion-LanguageDiversity10.8MLD
3D Human Pose TrackingKIT Motion-LanguageFID0.404MLD
3D Human Pose TrackingKIT Motion-LanguageMultimodality2.192MLD
3D Human Pose TrackingKIT Motion-LanguageR Precision Top30.734MLD
3D Human Pose TrackingKIT Motion-LanguageDiversity10.84TEMOS
3D Human Pose TrackingKIT Motion-LanguageFID3.717TEMOS
3D Human Pose TrackingKIT Motion-LanguageMultimodality0.532TEMOS
3D Human Pose TrackingKIT Motion-LanguageR Precision Top30.687TEMOS

Related Papers

SnapMoGen: Human Motion Generation from Expressive Texts2025-07-12Go to Zero: Towards Zero-shot Motion Generation with Million-scale Data2025-07-09Motion Generation: A Survey of Generative Approaches and Benchmarks2025-07-07DeepGesture: A conversational gesture synthesis system based on emotions and semantics2025-07-03A Unified Transformer-Based Framework with Pretraining For Whole Body Grasping Motion Generation2025-07-01VolumetricSMPL: A Neural Volumetric Body Model for Efficient Interactions, Contacts, and Collisions2025-06-29DuetGen: Music Driven Two-Person Dance Generation via Hierarchical Masked Modeling2025-06-23PlanMoGPT: Flow-Enhanced Progressive Planning for Text to Motion Synthesis2025-06-22