MoMask: Generative Masked Modeling of 3D Human Motions

Chuan Guo, Yuxuan Mu, Muhammad Gohar Javed, Sen Wang, Li Cheng

2023-11-29CVPR 2024 1Human motion prediction Motion Forecasting Motion Generation Motion Synthesis Motion Interpolation

Abstract

We introduce MoMask, a novel masked modeling framework for text-driven 3D human motion generation. In MoMask, a hierarchical quantization scheme is employed to represent human motion as multi-layer discrete motion tokens with high-fidelity details. Starting at the base layer, with a sequence of motion tokens obtained by vector quantization, the residual tokens of increasing orders are derived and stored at the subsequent layers of the hierarchy. This is consequently followed by two distinct bidirectional transformers. For the base-layer motion tokens, a Masked Transformer is designated to predict randomly masked motion tokens conditioned on text input at training stage. During generation (i.e. inference) stage, starting from an empty sequence, our Masked Transformer iteratively fills up the missing tokens; Subsequently, a Residual Transformer learns to progressively predict the next-layer tokens based on the results from current layer. Extensive experiments demonstrate that MoMask outperforms the state-of-art methods on the text-to-motion generation task, with an FID of 0.045 (vs e.g. 0.141 of T2M-GPT) on the HumanML3D dataset, and 0.228 (vs 0.514) on KIT-ML, respectively. MoMask can also be seamlessly applied in related tasks without further model fine-tuning, such as text-guided temporal inpainting.

Results

Task	Dataset	Metric	Value	Model
Pose Tracking	HumanML3D	FID	0.045	MoMask
Pose Tracking	HumanML3D	Multimodality	1.241	MoMask
Pose Tracking	HumanML3D	R Precision Top3	0.807	MoMask
Pose Tracking	KIT Motion-Language	FID	0.204	MoMask
Pose Tracking	KIT Motion-Language	Multimodality	1.131	MoMask
Pose Tracking	KIT Motion-Language	R Precision Top3	0.781	MoMask
Motion Synthesis	HumanML3D	FID	0.045	MoMask
Motion Synthesis	HumanML3D	Multimodality	1.241	MoMask
Motion Synthesis	HumanML3D	R Precision Top3	0.807	MoMask
Motion Synthesis	KIT Motion-Language	FID	0.204	MoMask
Motion Synthesis	KIT Motion-Language	Multimodality	1.131	MoMask
Motion Synthesis	KIT Motion-Language	R Precision Top3	0.781	MoMask
10-shot image generation	HumanML3D	FID	0.045	MoMask
10-shot image generation	HumanML3D	Multimodality	1.241	MoMask
10-shot image generation	HumanML3D	R Precision Top3	0.807	MoMask
10-shot image generation	KIT Motion-Language	FID	0.204	MoMask
10-shot image generation	KIT Motion-Language	Multimodality	1.131	MoMask
10-shot image generation	KIT Motion-Language	R Precision Top3	0.781	MoMask
3D Human Pose Tracking	HumanML3D	FID	0.045	MoMask
3D Human Pose Tracking	HumanML3D	Multimodality	1.241	MoMask
3D Human Pose Tracking	HumanML3D	R Precision Top3	0.807	MoMask
3D Human Pose Tracking	KIT Motion-Language	FID	0.204	MoMask
3D Human Pose Tracking	KIT Motion-Language	Multimodality	1.131	MoMask
3D Human Pose Tracking	KIT Motion-Language	R Precision Top3	0.781	MoMask

MoMask: Generative Masked Modeling of 3D Human Motions

Abstract

Results

Related Papers

MoMask: Generative Masked Modeling of 3D Human Motions

Abstract

Results

Related Papers