Motion-X++: A Large-Scale Multimodal 3D Whole-body Human Motion Dataset

Yuhong Zhang, Jing Lin, Ailing Zeng, Guanlin Wu, Shunlin Lu, Yurong Fu, Yuanhao Cai, Ruimao Zhang, Haoqian Wang, Lei Zhang

2025-01-09Human Mesh Recovery Motion Generation

Paper PDF Code

Abstract

In this paper, we introduce Motion-X++, a large-scale multimodal 3D expressive whole-body human motion dataset. Existing motion datasets predominantly capture body-only poses, lacking facial expressions, hand gestures, and fine-grained pose descriptions, and are typically limited to lab settings with manually labeled text descriptions, thereby restricting their scalability. To address this issue, we develop a scalable annotation pipeline that can automatically capture 3D whole-body human motion and comprehensive textural labels from RGB videos and build the Motion-X dataset comprising 81.1K text-motion pairs. Furthermore, we extend Motion-X into Motion-X++ by improving the annotation pipeline, introducing more data modalities, and scaling up the data quantities. Motion-X++ provides 19.5M 3D whole-body pose annotations covering 120.5K motion sequences from massive scenes, 80.8K RGB videos, 45.3K audios, 19.5M frame-level whole-body pose descriptions, and 120.5K sequence-level semantic labels. Comprehensive experiments validate the accuracy of our annotation pipeline and highlight Motion-X++'s significant benefits for generating expressive, precise, and natural motion with paired multimodal labels supporting several downstream tasks, including text-driven whole-body motion generation,audio-driven motion generation, 3D whole-body human mesh recovery, and 2D whole-body keypoints estimation, etc.

Related Papers

SnapMoGen: Human Motion Generation from Expressive Texts2025-07-12 Go to Zero: Towards Zero-shot Motion Generation with Million-scale Data2025-07-09 Motion Generation: A Survey of Generative Approaches and Benchmarks2025-07-07 A Unified Transformer-Based Framework with Pretraining For Whole Body Grasping Motion Generation2025-07-01 PlanMoGPT: Flow-Enhanced Progressive Planning for Text to Motion Synthesis2025-06-22 Human-Centered Editable Speech-to-Sign-Language Generation via Streaming Conformer-Transformer and Resampling Hook2025-06-17 RL from Physical Feedback: Aligning Large Motion Models with Humanoid Control2025-06-15 Motion-R1: Chain-of-Thought Reasoning and Reinforcement Learning for Human Motion Generation2025-06-12