TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/TM2D: Bimodality Driven 3D Dance Generation via Music-Text...

TM2D: Bimodality Driven 3D Dance Generation via Music-Text Integration

Kehong Gong, Dongze Lian, Heng Chang, Chuan Guo, Zihang Jiang, Xinxin Zuo, Michael Bi Mi, Xinchao Wang

2023-04-05ICCV 2023 1motion predictionMotion GenerationMotion Synthesis
PaperPDFCode(official)

Abstract

We propose a novel task for generating 3D dance movements that simultaneously incorporate both text and music modalities. Unlike existing works that generate dance movements using a single modality such as music, our goal is to produce richer dance movements guided by the instructive information provided by the text. However, the lack of paired motion data with both music and text modalities limits the ability to generate dance movements that integrate both. To alleviate this challenge, we propose to utilize a 3D human motion VQ-VAE to project the motions of the two datasets into a latent space consisting of quantized vectors, which effectively mix the motion tokens from the two datasets with different distributions for training. Additionally, we propose a cross-modal transformer to integrate text instructions into motion generation architecture for generating 3D dance movements without degrading the performance of music-conditioned dance generation. To better evaluate the quality of the generated motion, we introduce two novel metrics, namely Motion Prediction Distance (MPD) and Freezing Score (FS), to measure the coherence and freezing percentage of the generated motion. Extensive experiments show that our approach can generate realistic and coherent dance movements conditioned on both text and music while maintaining comparable performance with the two single modalities. Code is available at https://garfield-kh.github.io/TM2D/.

Results

TaskDatasetMetricValueModel
Pose TrackingHumanML3DDiversity9.513TM2D (t2m)
Pose TrackingHumanML3DFID1.021TM2D (t2m)
Pose TrackingHumanML3DMultimodality4.139TM2D (t2m)
Pose TrackingAIST++Beat alignment score0.2049TM2D
Pose TrackingAIST++FID19.01TM2D
Pose TrackingAIST++Beat alignment score0.2127TM2D (only motion data)
Pose TrackingAIST++FID23.94TM2D (only motion data)
Motion SynthesisHumanML3DDiversity9.513TM2D (t2m)
Motion SynthesisHumanML3DFID1.021TM2D (t2m)
Motion SynthesisHumanML3DMultimodality4.139TM2D (t2m)
Motion SynthesisAIST++Beat alignment score0.2049TM2D
Motion SynthesisAIST++FID19.01TM2D
Motion SynthesisAIST++Beat alignment score0.2127TM2D (only motion data)
Motion SynthesisAIST++FID23.94TM2D (only motion data)
10-shot image generationHumanML3DDiversity9.513TM2D (t2m)
10-shot image generationHumanML3DFID1.021TM2D (t2m)
10-shot image generationHumanML3DMultimodality4.139TM2D (t2m)
10-shot image generationAIST++Beat alignment score0.2049TM2D
10-shot image generationAIST++FID19.01TM2D
10-shot image generationAIST++Beat alignment score0.2127TM2D (only motion data)
10-shot image generationAIST++FID23.94TM2D (only motion data)
3D Human Pose TrackingHumanML3DDiversity9.513TM2D (t2m)
3D Human Pose TrackingHumanML3DFID1.021TM2D (t2m)
3D Human Pose TrackingHumanML3DMultimodality4.139TM2D (t2m)
3D Human Pose TrackingAIST++Beat alignment score0.2049TM2D
3D Human Pose TrackingAIST++FID19.01TM2D
3D Human Pose TrackingAIST++Beat alignment score0.2127TM2D (only motion data)
3D Human Pose TrackingAIST++FID23.94TM2D (only motion data)

Related Papers

SnapMoGen: Human Motion Generation from Expressive Texts2025-07-12Go to Zero: Towards Zero-shot Motion Generation with Million-scale Data2025-07-09Motion Generation: A Survey of Generative Approaches and Benchmarks2025-07-07Stochastic Human Motion Prediction with Memory of Action Transition and Action Characteristic2025-07-05Temporal Continual Learning with Prior Compensation for Human Motion Prediction2025-07-05DeepGesture: A conversational gesture synthesis system based on emotions and semantics2025-07-03A Unified Transformer-Based Framework with Pretraining For Whole Body Grasping Motion Generation2025-07-01VolumetricSMPL: A Neural Volumetric Body Model for Efficient Interactions, Contacts, and Collisions2025-06-29