TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/DiverseMotion: Towards Diverse Human Motion Generation via...

DiverseMotion: Towards Diverse Human Motion Generation via Discrete Diffusion

Yunhong Lou, Linchao Zhu, Yaxiong Wang, Xiaohan Wang, Yi Yang

2023-09-04Motion GenerationMotion SynthesisLanguage Modelling
PaperPDF

Abstract

We present DiverseMotion, a new approach for synthesizing high-quality human motions conditioned on textual descriptions while preserving motion diversity.Despite the recent significant process in text-based human motion generation,existing methods often prioritize fitting training motions at the expense of action diversity. Consequently, striking a balance between motion quality and diversity remains an unresolved challenge. This problem is compounded by two key factors: 1) the lack of diversity in motion-caption pairs in existing benchmarks and 2) the unilateral and biased semantic understanding of the text prompt, focusing primarily on the verb component while neglecting the nuanced distinctions indicated by other words.In response to the first issue, we construct a large-scale Wild Motion-Caption dataset (WMC) to extend the restricted action boundary of existing well-annotated datasets, enabling the learning of diverse motions through a more extensive range of actions. To this end, a motion BLIP is trained upon a pretrained vision-language model, then we automatically generate diverse motion captions for the collected motion sequences. As a result, we finally build a dataset comprising 8,888 motions coupled with 141k text.To comprehensively understand the text command, we propose a Hierarchical Semantic Aggregation (HSA) module to capture the fine-grained semantics.Finally,we involve the above two designs into an effective Motion Discrete Diffusion (MDD) framework to strike a balance between motion quality and diversity. Extensive experiments on HumanML3D and KIT-ML show that our DiverseMotion achieves the state-of-the-art motion quality and competitive motion diversity. Dataset, code, and pretrained models will be released to reproduce all of our results.

Results

TaskDatasetMetricValueModel
Pose TrackingHumanML3DDiversity9.551DiverseMotion (s=1)
Pose TrackingHumanML3DFID0.07DiverseMotion (s=1)
Pose TrackingHumanML3DMultimodality2.062DiverseMotion (s=1)
Pose TrackingHumanML3DR Precision Top30.783DiverseMotion (s=1)
Pose TrackingHumanML3DDiversity9.683DiverseMotion (s=2)
Pose TrackingHumanML3DFID0.072DiverseMotion (s=2)
Pose TrackingHumanML3DMultimodality1.869DiverseMotion (s=2)
Pose TrackingHumanML3DR Precision Top30.802DiverseMotion (s=2)
Pose TrackingKIT Motion-LanguageDiversity10.873DiverseMotion
Pose TrackingKIT Motion-LanguageFID0.468DiverseMotion
Pose TrackingKIT Motion-LanguageMultimodality2.062DiverseMotion
Pose TrackingKIT Motion-LanguageR Precision Top30.76DiverseMotion
Motion SynthesisHumanML3DDiversity9.551DiverseMotion (s=1)
Motion SynthesisHumanML3DFID0.07DiverseMotion (s=1)
Motion SynthesisHumanML3DMultimodality2.062DiverseMotion (s=1)
Motion SynthesisHumanML3DR Precision Top30.783DiverseMotion (s=1)
Motion SynthesisHumanML3DDiversity9.683DiverseMotion (s=2)
Motion SynthesisHumanML3DFID0.072DiverseMotion (s=2)
Motion SynthesisHumanML3DMultimodality1.869DiverseMotion (s=2)
Motion SynthesisHumanML3DR Precision Top30.802DiverseMotion (s=2)
Motion SynthesisKIT Motion-LanguageDiversity10.873DiverseMotion
Motion SynthesisKIT Motion-LanguageFID0.468DiverseMotion
Motion SynthesisKIT Motion-LanguageMultimodality2.062DiverseMotion
Motion SynthesisKIT Motion-LanguageR Precision Top30.76DiverseMotion
10-shot image generationHumanML3DDiversity9.551DiverseMotion (s=1)
10-shot image generationHumanML3DFID0.07DiverseMotion (s=1)
10-shot image generationHumanML3DMultimodality2.062DiverseMotion (s=1)
10-shot image generationHumanML3DR Precision Top30.783DiverseMotion (s=1)
10-shot image generationHumanML3DDiversity9.683DiverseMotion (s=2)
10-shot image generationHumanML3DFID0.072DiverseMotion (s=2)
10-shot image generationHumanML3DMultimodality1.869DiverseMotion (s=2)
10-shot image generationHumanML3DR Precision Top30.802DiverseMotion (s=2)
10-shot image generationKIT Motion-LanguageDiversity10.873DiverseMotion
10-shot image generationKIT Motion-LanguageFID0.468DiverseMotion
10-shot image generationKIT Motion-LanguageMultimodality2.062DiverseMotion
10-shot image generationKIT Motion-LanguageR Precision Top30.76DiverseMotion
3D Human Pose TrackingHumanML3DDiversity9.551DiverseMotion (s=1)
3D Human Pose TrackingHumanML3DFID0.07DiverseMotion (s=1)
3D Human Pose TrackingHumanML3DMultimodality2.062DiverseMotion (s=1)
3D Human Pose TrackingHumanML3DR Precision Top30.783DiverseMotion (s=1)
3D Human Pose TrackingHumanML3DDiversity9.683DiverseMotion (s=2)
3D Human Pose TrackingHumanML3DFID0.072DiverseMotion (s=2)
3D Human Pose TrackingHumanML3DMultimodality1.869DiverseMotion (s=2)
3D Human Pose TrackingHumanML3DR Precision Top30.802DiverseMotion (s=2)
3D Human Pose TrackingKIT Motion-LanguageDiversity10.873DiverseMotion
3D Human Pose TrackingKIT Motion-LanguageFID0.468DiverseMotion
3D Human Pose TrackingKIT Motion-LanguageMultimodality2.062DiverseMotion
3D Human Pose TrackingKIT Motion-LanguageR Precision Top30.76DiverseMotion

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21Making Language Model a Hierarchical Classifier and Generator2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities2025-07-17Assay2Mol: large language model-based drug design using BioAssay context2025-07-16Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16InstructFLIP: Exploring Unified Vision-Language Model for Face Anti-spoofing2025-07-16