TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Motion Anything: Any to Motion Generation

Motion Anything: Any to Motion Generation

Zeyu Zhang, Yiran Wang, Wei Mao, Danning Li, Rui Zhao, Biao Wu, Zirui Song, Bohan Zhuang, Ian Reid, Richard Hartley

2025-03-10Motion GenerationMotion Synthesis
PaperPDFCode(official)

Abstract

Conditional motion generation has been extensively studied in computer vision, yet two critical challenges remain. First, while masked autoregressive methods have recently outperformed diffusion-based approaches, existing masking models lack a mechanism to prioritize dynamic frames and body parts based on given conditions. Second, existing methods for different conditioning modalities often fail to integrate multiple modalities effectively, limiting control and coherence in generated motion. To address these challenges, we propose Motion Anything, a multimodal motion generation framework that introduces an Attention-based Mask Modeling approach, enabling fine-grained spatial and temporal control over key frames and actions. Our model adaptively encodes multimodal conditions, including text and music, improving controllability. Additionally, we introduce Text-Motion-Dance (TMD), a new motion dataset consisting of 2,153 pairs of text, music, and dance, making it twice the size of AIST++, thereby filling a critical gap in the community. Extensive experiments demonstrate that Motion Anything surpasses state-of-the-art methods across multiple benchmarks, achieving a 15% improvement in FID on HumanML3D and showing consistent performance gains on AIST++ and TMD. See our project website https://steve-zeyu-zhang.github.io/MotionAnything

Results

TaskDatasetMetricValueModel
Pose TrackingHumanML3DDiversity9.521Motion Anything
Pose TrackingHumanML3DFID0.028Motion Anything
Pose TrackingHumanML3DMultimodality2.705Motion Anything
Pose TrackingHumanML3DR Precision Top30.829Motion Anything
Pose TrackingTMDBAS0.2094Motion Anything
Pose TrackingTMDFID21.46Motion Anything
Pose TrackingTMDMMDist5.34Motion Anything
Pose TrackingTMDMModality2.424Motion Anything
Pose TrackingKIT Motion-LanguageDiversity10.94Motion Anything
Pose TrackingKIT Motion-LanguageFID0.131Motion Anything
Pose TrackingKIT Motion-LanguageMultimodality1.374Motion Anything
Pose TrackingKIT Motion-LanguageR Precision Top30.802Motion Anything
Pose TrackingAIST++Beat alignment score0.2757Motion Anything
Pose TrackingAIST++FID17.22Motion Anything
Motion SynthesisHumanML3DDiversity9.521Motion Anything
Motion SynthesisHumanML3DFID0.028Motion Anything
Motion SynthesisHumanML3DMultimodality2.705Motion Anything
Motion SynthesisHumanML3DR Precision Top30.829Motion Anything
Motion SynthesisTMDBAS0.2094Motion Anything
Motion SynthesisTMDFID21.46Motion Anything
Motion SynthesisTMDMMDist5.34Motion Anything
Motion SynthesisTMDMModality2.424Motion Anything
Motion SynthesisKIT Motion-LanguageDiversity10.94Motion Anything
Motion SynthesisKIT Motion-LanguageFID0.131Motion Anything
Motion SynthesisKIT Motion-LanguageMultimodality1.374Motion Anything
Motion SynthesisKIT Motion-LanguageR Precision Top30.802Motion Anything
Motion SynthesisAIST++Beat alignment score0.2757Motion Anything
Motion SynthesisAIST++FID17.22Motion Anything
10-shot image generationHumanML3DDiversity9.521Motion Anything
10-shot image generationHumanML3DFID0.028Motion Anything
10-shot image generationHumanML3DMultimodality2.705Motion Anything
10-shot image generationHumanML3DR Precision Top30.829Motion Anything
10-shot image generationTMDBAS0.2094Motion Anything
10-shot image generationTMDFID21.46Motion Anything
10-shot image generationTMDMMDist5.34Motion Anything
10-shot image generationTMDMModality2.424Motion Anything
10-shot image generationKIT Motion-LanguageDiversity10.94Motion Anything
10-shot image generationKIT Motion-LanguageFID0.131Motion Anything
10-shot image generationKIT Motion-LanguageMultimodality1.374Motion Anything
10-shot image generationKIT Motion-LanguageR Precision Top30.802Motion Anything
10-shot image generationAIST++Beat alignment score0.2757Motion Anything
10-shot image generationAIST++FID17.22Motion Anything
3D Human Pose TrackingHumanML3DDiversity9.521Motion Anything
3D Human Pose TrackingHumanML3DFID0.028Motion Anything
3D Human Pose TrackingHumanML3DMultimodality2.705Motion Anything
3D Human Pose TrackingHumanML3DR Precision Top30.829Motion Anything
3D Human Pose TrackingTMDBAS0.2094Motion Anything
3D Human Pose TrackingTMDFID21.46Motion Anything
3D Human Pose TrackingTMDMMDist5.34Motion Anything
3D Human Pose TrackingTMDMModality2.424Motion Anything
3D Human Pose TrackingKIT Motion-LanguageDiversity10.94Motion Anything
3D Human Pose TrackingKIT Motion-LanguageFID0.131Motion Anything
3D Human Pose TrackingKIT Motion-LanguageMultimodality1.374Motion Anything
3D Human Pose TrackingKIT Motion-LanguageR Precision Top30.802Motion Anything
3D Human Pose TrackingAIST++Beat alignment score0.2757Motion Anything
3D Human Pose TrackingAIST++FID17.22Motion Anything

Related Papers

SnapMoGen: Human Motion Generation from Expressive Texts2025-07-12Go to Zero: Towards Zero-shot Motion Generation with Million-scale Data2025-07-09Motion Generation: A Survey of Generative Approaches and Benchmarks2025-07-07DeepGesture: A conversational gesture synthesis system based on emotions and semantics2025-07-03A Unified Transformer-Based Framework with Pretraining For Whole Body Grasping Motion Generation2025-07-01VolumetricSMPL: A Neural Volumetric Body Model for Efficient Interactions, Contacts, and Collisions2025-06-29DuetGen: Music Driven Two-Person Dance Generation via Hierarchical Masked Modeling2025-06-23PlanMoGPT: Flow-Enhanced Progressive Planning for Text to Motion Synthesis2025-06-22