Executing your Commands via Motion Diffusion in Latent Space

Xin Chen, Biao Jiang, Wen Liu, Zilong Huang, Bin Fu, Tao Chen, Jingyi Yu, Gang Yu

2022-12-08CVPR 2023 1Motion Generation Motion Synthesis

Abstract

We study a challenging task, conditional human motion generation, which produces plausible human motion sequences according to various conditional inputs, such as action classes or textual descriptors. Since human motions are highly diverse and have a property of quite different distribution from conditional modalities, such as textual descriptors in natural languages, it is hard to learn a probabilistic mapping from the desired conditional modality to the human motion sequences. Besides, the raw motion data from the motion capture system might be redundant in sequences and contain noises; directly modeling the joint distribution over the raw motion sequences and conditional modalities would need a heavy computational overhead and might result in artifacts introduced by the captured noises. To learn a better representation of the various human motion sequences, we first design a powerful Variational AutoEncoder (VAE) and arrive at a representative and low-dimensional latent code for a human motion sequence. Then, instead of using a diffusion model to establish the connections between the raw motion sequences and the conditional inputs, we perform a diffusion process on the motion latent space. Our proposed Motion Latent-based Diffusion model (MLD) could produce vivid motion sequences conforming to the given conditional inputs and substantially reduce the computational overhead in both the training and inference stages. Extensive experiments on various human motion generation tasks demonstrate that our MLD achieves significant improvements over the state-of-the-art methods among extensive human motion generation tasks, with two orders of magnitude faster than previous diffusion models on raw motion sequences.

Results

Task	Dataset	Metric	Value	Model
Pose Tracking	HumanML3D	Diversity	9.724	MLD
Pose Tracking	HumanML3D	FID	0.473	MLD
Pose Tracking	HumanML3D	Multimodality	2.413	MLD
Pose Tracking	HumanML3D	R Precision Top3	0.772	MLD
Pose Tracking	Motion-X	Diversity	10.42	MLD
Pose Tracking	Motion-X	FID	3.407	MLD
Pose Tracking	Motion-X	MModality	2.448	MLD
Pose Tracking	Motion-X	TMR-Matching Score	0.883	MLD
Pose Tracking	Motion-X	TMR-R-Precision Top3	0.683	MLD
Pose Tracking	HumanAct12	Accuracy	0.964	MLD
Pose Tracking	HumanAct12	FID	0.077	MLD
Pose Tracking	HumanAct12	Multimodality	2.824	MLD
Pose Tracking	KIT Motion-Language	Diversity	10.8	MLD
Pose Tracking	KIT Motion-Language	FID	0.404	MLD
Pose Tracking	KIT Motion-Language	Multimodality	2.192	MLD
Pose Tracking	KIT Motion-Language	R Precision Top3	0.734	MLD
Pose Tracking	KIT Motion-Language	Diversity	10.84	TEMOS
Pose Tracking	KIT Motion-Language	FID	3.717	TEMOS
Pose Tracking	KIT Motion-Language	Multimodality	0.532	TEMOS
Pose Tracking	KIT Motion-Language	R Precision Top3	0.687	TEMOS
Motion Synthesis	HumanML3D	Diversity	9.724	MLD
Motion Synthesis	HumanML3D	FID	0.473	MLD
Motion Synthesis	HumanML3D	Multimodality	2.413	MLD
Motion Synthesis	HumanML3D	R Precision Top3	0.772	MLD
Motion Synthesis	Motion-X	Diversity	10.42	MLD
Motion Synthesis	Motion-X	FID	3.407	MLD
Motion Synthesis	Motion-X	MModality	2.448	MLD
Motion Synthesis	Motion-X	TMR-Matching Score	0.883	MLD
Motion Synthesis	Motion-X	TMR-R-Precision Top3	0.683	MLD
Motion Synthesis	HumanAct12	Accuracy	0.964	MLD
Motion Synthesis	HumanAct12	FID	0.077	MLD
Motion Synthesis	HumanAct12	Multimodality	2.824	MLD
Motion Synthesis	KIT Motion-Language	Diversity	10.8	MLD
Motion Synthesis	KIT Motion-Language	FID	0.404	MLD
Motion Synthesis	KIT Motion-Language	Multimodality	2.192	MLD
Motion Synthesis	KIT Motion-Language	R Precision Top3	0.734	MLD
Motion Synthesis	KIT Motion-Language	Diversity	10.84	TEMOS
Motion Synthesis	KIT Motion-Language	FID	3.717	TEMOS
Motion Synthesis	KIT Motion-Language	Multimodality	0.532	TEMOS
Motion Synthesis	KIT Motion-Language	R Precision Top3	0.687	TEMOS
10-shot image generation	HumanML3D	Diversity	9.724	MLD
10-shot image generation	HumanML3D	FID	0.473	MLD
10-shot image generation	HumanML3D	Multimodality	2.413	MLD
10-shot image generation	HumanML3D	R Precision Top3	0.772	MLD
10-shot image generation	Motion-X	Diversity	10.42	MLD
10-shot image generation	Motion-X	FID	3.407	MLD
10-shot image generation	Motion-X	MModality	2.448	MLD
10-shot image generation	Motion-X	TMR-Matching Score	0.883	MLD
10-shot image generation	Motion-X	TMR-R-Precision Top3	0.683	MLD
10-shot image generation	HumanAct12	Accuracy	0.964	MLD
10-shot image generation	HumanAct12	FID	0.077	MLD
10-shot image generation	HumanAct12	Multimodality	2.824	MLD
10-shot image generation	KIT Motion-Language	Diversity	10.8	MLD
10-shot image generation	KIT Motion-Language	FID	0.404	MLD
10-shot image generation	KIT Motion-Language	Multimodality	2.192	MLD
10-shot image generation	KIT Motion-Language	R Precision Top3	0.734	MLD
10-shot image generation	KIT Motion-Language	Diversity	10.84	TEMOS
10-shot image generation	KIT Motion-Language	FID	3.717	TEMOS
10-shot image generation	KIT Motion-Language	Multimodality	0.532	TEMOS
10-shot image generation	KIT Motion-Language	R Precision Top3	0.687	TEMOS
3D Human Pose Tracking	HumanML3D	Diversity	9.724	MLD
3D Human Pose Tracking	HumanML3D	FID	0.473	MLD
3D Human Pose Tracking	HumanML3D	Multimodality	2.413	MLD
3D Human Pose Tracking	HumanML3D	R Precision Top3	0.772	MLD
3D Human Pose Tracking	Motion-X	Diversity	10.42	MLD
3D Human Pose Tracking	Motion-X	FID	3.407	MLD
3D Human Pose Tracking	Motion-X	MModality	2.448	MLD
3D Human Pose Tracking	Motion-X	TMR-Matching Score	0.883	MLD
3D Human Pose Tracking	Motion-X	TMR-R-Precision Top3	0.683	MLD
3D Human Pose Tracking	HumanAct12	Accuracy	0.964	MLD
3D Human Pose Tracking	HumanAct12	FID	0.077	MLD
3D Human Pose Tracking	HumanAct12	Multimodality	2.824	MLD
3D Human Pose Tracking	KIT Motion-Language	Diversity	10.8	MLD
3D Human Pose Tracking	KIT Motion-Language	FID	0.404	MLD
3D Human Pose Tracking	KIT Motion-Language	Multimodality	2.192	MLD
3D Human Pose Tracking	KIT Motion-Language	R Precision Top3	0.734	MLD
3D Human Pose Tracking	KIT Motion-Language	Diversity	10.84	TEMOS
3D Human Pose Tracking	KIT Motion-Language	FID	3.717	TEMOS
3D Human Pose Tracking	KIT Motion-Language	Multimodality	0.532	TEMOS
3D Human Pose Tracking	KIT Motion-Language	R Precision Top3	0.687	TEMOS

Abstract

Results

Task	Dataset	Metric	Value	Model
Pose Tracking	HumanML3D	Diversity	9.724	MLD
Pose Tracking	HumanML3D	FID	0.473	MLD
Pose Tracking	HumanML3D	Multimodality	2.413	MLD
Pose Tracking	HumanML3D	R Precision Top3	0.772	MLD
Pose Tracking	Motion-X	Diversity	10.42	MLD
Pose Tracking	Motion-X	FID	3.407	MLD
Pose Tracking	Motion-X	MModality	2.448	MLD
Pose Tracking	Motion-X	TMR-Matching Score	0.883	MLD
Pose Tracking	Motion-X	TMR-R-Precision Top3	0.683	MLD
Pose Tracking	HumanAct12	Accuracy	0.964	MLD
Pose Tracking	HumanAct12	FID	0.077	MLD
Pose Tracking	HumanAct12	Multimodality	2.824	MLD
Pose Tracking	KIT Motion-Language	Diversity	10.8	MLD
Pose Tracking	KIT Motion-Language	FID	0.404	MLD
Pose Tracking	KIT Motion-Language	Multimodality	2.192	MLD
Pose Tracking	KIT Motion-Language	R Precision Top3	0.734	MLD
Pose Tracking	KIT Motion-Language	Diversity	10.84	TEMOS
Pose Tracking	KIT Motion-Language	FID	3.717	TEMOS
Pose Tracking	KIT Motion-Language	Multimodality	0.532	TEMOS
Pose Tracking	KIT Motion-Language	R Precision Top3	0.687	TEMOS
Motion Synthesis	HumanML3D	Diversity	9.724	MLD
Motion Synthesis	HumanML3D	FID	0.473	MLD
Motion Synthesis	HumanML3D	Multimodality	2.413	MLD
Motion Synthesis	HumanML3D	R Precision Top3	0.772	MLD
Motion Synthesis	Motion-X	Diversity	10.42	MLD
Motion Synthesis	Motion-X	FID	3.407	MLD
Motion Synthesis	Motion-X	MModality	2.448	MLD
Motion Synthesis	Motion-X	TMR-Matching Score	0.883	MLD
Motion Synthesis	Motion-X	TMR-R-Precision Top3	0.683	MLD
Motion Synthesis	HumanAct12	Accuracy	0.964	MLD
Motion Synthesis	HumanAct12	FID	0.077	MLD
Motion Synthesis	HumanAct12	Multimodality	2.824	MLD
Motion Synthesis	KIT Motion-Language	Diversity	10.8	MLD
Motion Synthesis	KIT Motion-Language	FID	0.404	MLD
Motion Synthesis	KIT Motion-Language	Multimodality	2.192	MLD
Motion Synthesis	KIT Motion-Language	R Precision Top3	0.734	MLD
Motion Synthesis	KIT Motion-Language	Diversity	10.84	TEMOS
Motion Synthesis	KIT Motion-Language	FID	3.717	TEMOS
Motion Synthesis	KIT Motion-Language	Multimodality	0.532	TEMOS
Motion Synthesis	KIT Motion-Language	R Precision Top3	0.687	TEMOS
10-shot image generation	HumanML3D	Diversity	9.724	MLD
10-shot image generation	HumanML3D	FID	0.473	MLD
10-shot image generation	HumanML3D	Multimodality	2.413	MLD
10-shot image generation	HumanML3D	R Precision Top3	0.772	MLD
10-shot image generation	Motion-X	Diversity	10.42	MLD
10-shot image generation	Motion-X	FID	3.407	MLD
10-shot image generation	Motion-X	MModality	2.448	MLD
10-shot image generation	Motion-X	TMR-Matching Score	0.883	MLD
10-shot image generation	Motion-X	TMR-R-Precision Top3	0.683	MLD
10-shot image generation	HumanAct12	Accuracy	0.964	MLD
10-shot image generation	HumanAct12	FID	0.077	MLD
10-shot image generation	HumanAct12	Multimodality	2.824	MLD
10-shot image generation	KIT Motion-Language	Diversity	10.8	MLD
10-shot image generation	KIT Motion-Language	FID	0.404	MLD
10-shot image generation	KIT Motion-Language	Multimodality	2.192	MLD
10-shot image generation	KIT Motion-Language	R Precision Top3	0.734	MLD
10-shot image generation	KIT Motion-Language	Diversity	10.84	TEMOS
10-shot image generation	KIT Motion-Language	FID	3.717	TEMOS
10-shot image generation	KIT Motion-Language	Multimodality	0.532	TEMOS
10-shot image generation	KIT Motion-Language	R Precision Top3	0.687	TEMOS
3D Human Pose Tracking	HumanML3D	Diversity	9.724	MLD
3D Human Pose Tracking	HumanML3D	FID	0.473	MLD
3D Human Pose Tracking	HumanML3D	Multimodality	2.413	MLD
3D Human Pose Tracking	HumanML3D	R Precision Top3	0.772	MLD
3D Human Pose Tracking	Motion-X	Diversity	10.42	MLD
3D Human Pose Tracking	Motion-X	FID	3.407	MLD
3D Human Pose Tracking	Motion-X	MModality	2.448	MLD
3D Human Pose Tracking	Motion-X	TMR-Matching Score	0.883	MLD
3D Human Pose Tracking	Motion-X	TMR-R-Precision Top3	0.683	MLD
3D Human Pose Tracking	HumanAct12	Accuracy	0.964	MLD
3D Human Pose Tracking	HumanAct12	FID	0.077	MLD
3D Human Pose Tracking	HumanAct12	Multimodality	2.824	MLD
3D Human Pose Tracking	KIT Motion-Language	Diversity	10.8	MLD
3D Human Pose Tracking	KIT Motion-Language	FID	0.404	MLD
3D Human Pose Tracking	KIT Motion-Language	Multimodality	2.192	MLD
3D Human Pose Tracking	KIT Motion-Language	R Precision Top3	0.734	MLD
3D Human Pose Tracking	KIT Motion-Language	Diversity	10.84	TEMOS
3D Human Pose Tracking	KIT Motion-Language	FID	3.717	TEMOS
3D Human Pose Tracking	KIT Motion-Language	Multimodality	0.532	TEMOS
3D Human Pose Tracking	KIT Motion-Language	R Precision Top3	0.687	TEMOS

Executing your Commands via Motion Diffusion in Latent Space

Abstract

Results

Related Papers

Executing your Commands via Motion Diffusion in Latent Space

Abstract

Results

Related Papers