TM2D: Bimodality Driven 3D Dance Generation via Music-Text Integration

Kehong Gong, Dongze Lian, Heng Chang, Chuan Guo, Zihang Jiang, Xinxin Zuo, Michael Bi Mi, Xinchao Wang

2023-04-05ICCV 2023 1motion prediction Motion Generation Motion Synthesis

Abstract

We propose a novel task for generating 3D dance movements that simultaneously incorporate both text and music modalities. Unlike existing works that generate dance movements using a single modality such as music, our goal is to produce richer dance movements guided by the instructive information provided by the text. However, the lack of paired motion data with both music and text modalities limits the ability to generate dance movements that integrate both. To alleviate this challenge, we propose to utilize a 3D human motion VQ-VAE to project the motions of the two datasets into a latent space consisting of quantized vectors, which effectively mix the motion tokens from the two datasets with different distributions for training. Additionally, we propose a cross-modal transformer to integrate text instructions into motion generation architecture for generating 3D dance movements without degrading the performance of music-conditioned dance generation. To better evaluate the quality of the generated motion, we introduce two novel metrics, namely Motion Prediction Distance (MPD) and Freezing Score (FS), to measure the coherence and freezing percentage of the generated motion. Extensive experiments show that our approach can generate realistic and coherent dance movements conditioned on both text and music while maintaining comparable performance with the two single modalities. Code is available at https://garfield-kh.github.io/TM2D/.

Results

Task	Dataset	Metric	Value	Model
Pose Tracking	HumanML3D	Diversity	9.513	TM2D (t2m)
Pose Tracking	HumanML3D	FID	1.021	TM2D (t2m)
Pose Tracking	HumanML3D	Multimodality	4.139	TM2D (t2m)
Pose Tracking	AIST++	Beat alignment score	0.2049	TM2D
Pose Tracking	AIST++	FID	19.01	TM2D
Pose Tracking	AIST++	Beat alignment score	0.2127	TM2D (only motion data)
Pose Tracking	AIST++	FID	23.94	TM2D (only motion data)
Motion Synthesis	HumanML3D	Diversity	9.513	TM2D (t2m)
Motion Synthesis	HumanML3D	FID	1.021	TM2D (t2m)
Motion Synthesis	HumanML3D	Multimodality	4.139	TM2D (t2m)
Motion Synthesis	AIST++	Beat alignment score	0.2049	TM2D
Motion Synthesis	AIST++	FID	19.01	TM2D
Motion Synthesis	AIST++	Beat alignment score	0.2127	TM2D (only motion data)
Motion Synthesis	AIST++	FID	23.94	TM2D (only motion data)
10-shot image generation	HumanML3D	Diversity	9.513	TM2D (t2m)
10-shot image generation	HumanML3D	FID	1.021	TM2D (t2m)
10-shot image generation	HumanML3D	Multimodality	4.139	TM2D (t2m)
10-shot image generation	AIST++	Beat alignment score	0.2049	TM2D
10-shot image generation	AIST++	FID	19.01	TM2D
10-shot image generation	AIST++	Beat alignment score	0.2127	TM2D (only motion data)
10-shot image generation	AIST++	FID	23.94	TM2D (only motion data)
3D Human Pose Tracking	HumanML3D	Diversity	9.513	TM2D (t2m)
3D Human Pose Tracking	HumanML3D	FID	1.021	TM2D (t2m)
3D Human Pose Tracking	HumanML3D	Multimodality	4.139	TM2D (t2m)
3D Human Pose Tracking	AIST++	Beat alignment score	0.2049	TM2D
3D Human Pose Tracking	AIST++	FID	19.01	TM2D
3D Human Pose Tracking	AIST++	Beat alignment score	0.2127	TM2D (only motion data)
3D Human Pose Tracking	AIST++	FID	23.94	TM2D (only motion data)

Abstract

Results

Task	Dataset	Metric	Value	Model
Pose Tracking	HumanML3D	Diversity	9.513	TM2D (t2m)
Pose Tracking	HumanML3D	FID	1.021	TM2D (t2m)
Pose Tracking	HumanML3D	Multimodality	4.139	TM2D (t2m)
Pose Tracking	AIST++	Beat alignment score	0.2049	TM2D
Pose Tracking	AIST++	FID	19.01	TM2D
Pose Tracking	AIST++	Beat alignment score	0.2127	TM2D (only motion data)
Pose Tracking	AIST++	FID	23.94	TM2D (only motion data)
Motion Synthesis	HumanML3D	Diversity	9.513	TM2D (t2m)
Motion Synthesis	HumanML3D	FID	1.021	TM2D (t2m)
Motion Synthesis	HumanML3D	Multimodality	4.139	TM2D (t2m)
Motion Synthesis	AIST++	Beat alignment score	0.2049	TM2D
Motion Synthesis	AIST++	FID	19.01	TM2D
Motion Synthesis	AIST++	Beat alignment score	0.2127	TM2D (only motion data)
Motion Synthesis	AIST++	FID	23.94	TM2D (only motion data)
10-shot image generation	HumanML3D	Diversity	9.513	TM2D (t2m)
10-shot image generation	HumanML3D	FID	1.021	TM2D (t2m)
10-shot image generation	HumanML3D	Multimodality	4.139	TM2D (t2m)
10-shot image generation	AIST++	Beat alignment score	0.2049	TM2D
10-shot image generation	AIST++	FID	19.01	TM2D
10-shot image generation	AIST++	Beat alignment score	0.2127	TM2D (only motion data)
10-shot image generation	AIST++	FID	23.94	TM2D (only motion data)
3D Human Pose Tracking	HumanML3D	Diversity	9.513	TM2D (t2m)
3D Human Pose Tracking	HumanML3D	FID	1.021	TM2D (t2m)
3D Human Pose Tracking	HumanML3D	Multimodality	4.139	TM2D (t2m)
3D Human Pose Tracking	AIST++	Beat alignment score	0.2049	TM2D
3D Human Pose Tracking	AIST++	FID	19.01	TM2D
3D Human Pose Tracking	AIST++	Beat alignment score	0.2127	TM2D (only motion data)
3D Human Pose Tracking	AIST++	FID	23.94	TM2D (only motion data)

TM2D: Bimodality Driven 3D Dance Generation via Music-Text Integration

Abstract

Results

Related Papers

TM2D: Bimodality Driven 3D Dance Generation via Music-Text Integration

Abstract

Results

Related Papers