TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/MMM: Generative Masked Motion Model

MMM: Generative Masked Motion Model

Ekkasit Pinyoanuntapong, Pu Wang, Minwoo Lee, Chen Chen

2023-12-06CVPR 2024 1Motion GenerationMotion Synthesis
PaperPDFCode(official)

Abstract

Recent advances in text-to-motion generation using diffusion and autoregressive models have shown promising results. However, these models often suffer from a trade-off between real-time performance, high fidelity, and motion editability. To address this gap, we introduce MMM, a novel yet simple motion generation paradigm based on Masked Motion Model. MMM consists of two key components: (1) a motion tokenizer that transforms 3D human motion into a sequence of discrete tokens in latent space, and (2) a conditional masked motion transformer that learns to predict randomly masked motion tokens, conditioned on the pre-computed text tokens. By attending to motion and text tokens in all directions, MMM explicitly captures inherent dependency among motion tokens and semantic mapping between motion and text tokens. During inference, this allows parallel and iterative decoding of multiple motion tokens that are highly consistent with fine-grained text descriptions, therefore simultaneously achieving high-fidelity and high-speed motion generation. In addition, MMM has innate motion editability. By simply placing mask tokens in the place that needs editing, MMM automatically fills the gaps while guaranteeing smooth transitions between editing and non-editing parts. Extensive experiments on the HumanML3D and KIT-ML datasets demonstrate that MMM surpasses current leading methods in generating high-quality motion (evidenced by superior FID scores of 0.08 and 0.429), while offering advanced editing features such as body-part modification, motion in-betweening, and the synthesis of long motion sequences. In addition, MMM is two orders of magnitude faster on a single mid-range GPU than editable motion diffusion models. Our project page is available at \url{https://exitudio.github.io/MMM-page}.

Results

TaskDatasetMetricValueModel
Pose TrackingHumanML3DDiversity9.411MMM (predict length)
Pose TrackingHumanML3DFID0.08MMM (predict length)
Pose TrackingHumanML3DMultimodality1.164MMM (predict length)
Pose TrackingHumanML3DR Precision Top30.794MMM (predict length)
Pose TrackingHumanML3DDiversity9.577MMM (gt length)
Pose TrackingHumanML3DFID0.089MMM (gt length)
Pose TrackingHumanML3DMultimodality1.226MMM (gt length)
Pose TrackingHumanML3DR Precision Top30.804MMM (gt length)
Pose TrackingKIT Motion-LanguageDiversity10.91MMM (gt length)
Pose TrackingKIT Motion-LanguageFID0.316MMM (gt length)
Pose TrackingKIT Motion-LanguageMultimodality1.232MMM (gt length)
Pose TrackingKIT Motion-LanguageR Precision Top30.744MMM (gt length)
Pose TrackingKIT Motion-LanguageDiversity10.633MMM (predict length)
Pose TrackingKIT Motion-LanguageFID0.429MMM (predict length)
Pose TrackingKIT Motion-LanguageMultimodality1.105MMM (predict length)
Pose TrackingKIT Motion-LanguageR Precision Top30.718MMM (predict length)
Motion SynthesisHumanML3DDiversity9.411MMM (predict length)
Motion SynthesisHumanML3DFID0.08MMM (predict length)
Motion SynthesisHumanML3DMultimodality1.164MMM (predict length)
Motion SynthesisHumanML3DR Precision Top30.794MMM (predict length)
Motion SynthesisHumanML3DDiversity9.577MMM (gt length)
Motion SynthesisHumanML3DFID0.089MMM (gt length)
Motion SynthesisHumanML3DMultimodality1.226MMM (gt length)
Motion SynthesisHumanML3DR Precision Top30.804MMM (gt length)
Motion SynthesisKIT Motion-LanguageDiversity10.91MMM (gt length)
Motion SynthesisKIT Motion-LanguageFID0.316MMM (gt length)
Motion SynthesisKIT Motion-LanguageMultimodality1.232MMM (gt length)
Motion SynthesisKIT Motion-LanguageR Precision Top30.744MMM (gt length)
Motion SynthesisKIT Motion-LanguageDiversity10.633MMM (predict length)
Motion SynthesisKIT Motion-LanguageFID0.429MMM (predict length)
Motion SynthesisKIT Motion-LanguageMultimodality1.105MMM (predict length)
Motion SynthesisKIT Motion-LanguageR Precision Top30.718MMM (predict length)
10-shot image generationHumanML3DDiversity9.411MMM (predict length)
10-shot image generationHumanML3DFID0.08MMM (predict length)
10-shot image generationHumanML3DMultimodality1.164MMM (predict length)
10-shot image generationHumanML3DR Precision Top30.794MMM (predict length)
10-shot image generationHumanML3DDiversity9.577MMM (gt length)
10-shot image generationHumanML3DFID0.089MMM (gt length)
10-shot image generationHumanML3DMultimodality1.226MMM (gt length)
10-shot image generationHumanML3DR Precision Top30.804MMM (gt length)
10-shot image generationKIT Motion-LanguageDiversity10.91MMM (gt length)
10-shot image generationKIT Motion-LanguageFID0.316MMM (gt length)
10-shot image generationKIT Motion-LanguageMultimodality1.232MMM (gt length)
10-shot image generationKIT Motion-LanguageR Precision Top30.744MMM (gt length)
10-shot image generationKIT Motion-LanguageDiversity10.633MMM (predict length)
10-shot image generationKIT Motion-LanguageFID0.429MMM (predict length)
10-shot image generationKIT Motion-LanguageMultimodality1.105MMM (predict length)
10-shot image generationKIT Motion-LanguageR Precision Top30.718MMM (predict length)
3D Human Pose TrackingHumanML3DDiversity9.411MMM (predict length)
3D Human Pose TrackingHumanML3DFID0.08MMM (predict length)
3D Human Pose TrackingHumanML3DMultimodality1.164MMM (predict length)
3D Human Pose TrackingHumanML3DR Precision Top30.794MMM (predict length)
3D Human Pose TrackingHumanML3DDiversity9.577MMM (gt length)
3D Human Pose TrackingHumanML3DFID0.089MMM (gt length)
3D Human Pose TrackingHumanML3DMultimodality1.226MMM (gt length)
3D Human Pose TrackingHumanML3DR Precision Top30.804MMM (gt length)
3D Human Pose TrackingKIT Motion-LanguageDiversity10.91MMM (gt length)
3D Human Pose TrackingKIT Motion-LanguageFID0.316MMM (gt length)
3D Human Pose TrackingKIT Motion-LanguageMultimodality1.232MMM (gt length)
3D Human Pose TrackingKIT Motion-LanguageR Precision Top30.744MMM (gt length)
3D Human Pose TrackingKIT Motion-LanguageDiversity10.633MMM (predict length)
3D Human Pose TrackingKIT Motion-LanguageFID0.429MMM (predict length)
3D Human Pose TrackingKIT Motion-LanguageMultimodality1.105MMM (predict length)
3D Human Pose TrackingKIT Motion-LanguageR Precision Top30.718MMM (predict length)

Related Papers

SnapMoGen: Human Motion Generation from Expressive Texts2025-07-12Go to Zero: Towards Zero-shot Motion Generation with Million-scale Data2025-07-09Motion Generation: A Survey of Generative Approaches and Benchmarks2025-07-07DeepGesture: A conversational gesture synthesis system based on emotions and semantics2025-07-03A Unified Transformer-Based Framework with Pretraining For Whole Body Grasping Motion Generation2025-07-01VolumetricSMPL: A Neural Volumetric Body Model for Efficient Interactions, Contacts, and Collisions2025-06-29DuetGen: Music Driven Two-Person Dance Generation via Hierarchical Masked Modeling2025-06-23PlanMoGPT: Flow-Enhanced Progressive Planning for Text to Motion Synthesis2025-06-22