TM2T: Stochastic and Tokenized Modeling for the Reciprocal Generation of 3D Human Motions and Texts

Chuan Guo, Xinxin Zuo, Sen Wang, Li Cheng

2022-07-04Machine Translation NMT Motion Captioning Motion Synthesis

Abstract

Inspired by the strong ties between vision and language, the two intimate human sensing and communication modalities, our paper aims to explore the generation of 3D human full-body motions from texts, as well as its reciprocal task, shorthanded for text2motion and motion2text, respectively. To tackle the existing challenges, especially to enable the generation of multiple distinct motions from the same text, and to avoid the undesirable production of trivial motionless pose sequences, we propose the use of motion token, a discrete and compact motion representation. This provides one level playing ground when considering both motions and text signals, as the motion and text tokens, respectively. Moreover, our motion2text module is integrated into the inverse alignment process of our text2motion training pipeline, where a significant deviation of synthesized text from the input text would be penalized by a large training loss; empirically this is shown to effectively improve performance. Finally, the mappings in-between the two modalities of motions and texts are facilitated by adapting the neural model for machine translation (NMT) to our context. This autoregressive modeling of the distribution over discrete motion tokens further enables non-deterministic production of pose sequences, of variable lengths, from an input text. Our approach is flexible, could be used for both text2motion and motion2text tasks. Empirical evaluations on two benchmark datasets demonstrate the superior performance of our approach on both tasks over a variety of state-of-the-art methods. Project page: https://ericguo5513.github.io/TM2T/

Results

Task	Dataset	Metric	Value	Model
Pose Tracking	HumanML3D	Diversity	8.589	TM2T
Pose Tracking	HumanML3D	FID	1.501	TM2T
Pose Tracking	HumanML3D	Multimodality	2.424	TM2T
Pose Tracking	HumanML3D	R Precision Top3	0.729	TM2T
Pose Tracking	HumanML3D	Diversity	6.409	Text2Gesture
Pose Tracking	HumanML3D	FID	5.012	Text2Gesture
Pose Tracking	HumanML3D	R Precision Top3	0.345	Text2Gesture
Pose Tracking	HumanML3D	Diversity	7.676	Language2Pose
Pose Tracking	HumanML3D	FID	11.02	Language2Pose
Pose Tracking	HumanML3D	R Precision Top3	0.486	Language2Pose
Pose Tracking	KIT Motion-Language	Diversity	9.473	TM2T
Pose Tracking	KIT Motion-Language	FID	3.599	TM2T
Pose Tracking	KIT Motion-Language	Multimodality	3.292	TM2T
Pose Tracking	KIT Motion-Language	R Precision Top3	0.587	TM2T
Pose Tracking	KIT Motion-Language	Diversity	9.073	Language2Pose
Pose Tracking	KIT Motion-Language	FID	6.545	Language2Pose
Pose Tracking	KIT Motion-Language	R Precision Top3	0.483	Language2Pose
Pose Tracking	KIT Motion-Language	Diversity	9.334	Text2Gesture
Pose Tracking	KIT Motion-Language	FID	12.12	Text2Gesture
Pose Tracking	KIT Motion-Language	R Precision Top3	0.338	Text2Gesture
Motion Synthesis	HumanML3D	Diversity	8.589	TM2T
Motion Synthesis	HumanML3D	FID	1.501	TM2T
Motion Synthesis	HumanML3D	Multimodality	2.424	TM2T
Motion Synthesis	HumanML3D	R Precision Top3	0.729	TM2T
Motion Synthesis	HumanML3D	Diversity	6.409	Text2Gesture
Motion Synthesis	HumanML3D	FID	5.012	Text2Gesture
Motion Synthesis	HumanML3D	R Precision Top3	0.345	Text2Gesture
Motion Synthesis	HumanML3D	Diversity	7.676	Language2Pose
Motion Synthesis	HumanML3D	FID	11.02	Language2Pose
Motion Synthesis	HumanML3D	R Precision Top3	0.486	Language2Pose
Motion Synthesis	KIT Motion-Language	Diversity	9.473	TM2T
Motion Synthesis	KIT Motion-Language	FID	3.599	TM2T
Motion Synthesis	KIT Motion-Language	Multimodality	3.292	TM2T
Motion Synthesis	KIT Motion-Language	R Precision Top3	0.587	TM2T
Motion Synthesis	KIT Motion-Language	Diversity	9.073	Language2Pose
Motion Synthesis	KIT Motion-Language	FID	6.545	Language2Pose
Motion Synthesis	KIT Motion-Language	R Precision Top3	0.483	Language2Pose
Motion Synthesis	KIT Motion-Language	Diversity	9.334	Text2Gesture
Motion Synthesis	KIT Motion-Language	FID	12.12	Text2Gesture
Motion Synthesis	KIT Motion-Language	R Precision Top3	0.338	Text2Gesture
Motion Captioning	HumanML3D	BERTScore	37.8	TM2T
Motion Captioning	HumanML3D	BLEU-4	22.3	TM2T
Motion Captioning	KIT Motion-Language	BERTScore	23	TM2T
Motion Captioning	KIT Motion-Language	BLEU-4	18.4	TM2T
10-shot image generation	HumanML3D	Diversity	8.589	TM2T
10-shot image generation	HumanML3D	FID	1.501	TM2T
10-shot image generation	HumanML3D	Multimodality	2.424	TM2T
10-shot image generation	HumanML3D	R Precision Top3	0.729	TM2T
10-shot image generation	HumanML3D	Diversity	6.409	Text2Gesture
10-shot image generation	HumanML3D	FID	5.012	Text2Gesture
10-shot image generation	HumanML3D	R Precision Top3	0.345	Text2Gesture
10-shot image generation	HumanML3D	Diversity	7.676	Language2Pose
10-shot image generation	HumanML3D	FID	11.02	Language2Pose
10-shot image generation	HumanML3D	R Precision Top3	0.486	Language2Pose
10-shot image generation	KIT Motion-Language	Diversity	9.473	TM2T
10-shot image generation	KIT Motion-Language	FID	3.599	TM2T
10-shot image generation	KIT Motion-Language	Multimodality	3.292	TM2T
10-shot image generation	KIT Motion-Language	R Precision Top3	0.587	TM2T
10-shot image generation	KIT Motion-Language	Diversity	9.073	Language2Pose
10-shot image generation	KIT Motion-Language	FID	6.545	Language2Pose
10-shot image generation	KIT Motion-Language	R Precision Top3	0.483	Language2Pose
10-shot image generation	KIT Motion-Language	Diversity	9.334	Text2Gesture
10-shot image generation	KIT Motion-Language	FID	12.12	Text2Gesture
10-shot image generation	KIT Motion-Language	R Precision Top3	0.338	Text2Gesture
3D Human Pose Tracking	HumanML3D	Diversity	8.589	TM2T
3D Human Pose Tracking	HumanML3D	FID	1.501	TM2T
3D Human Pose Tracking	HumanML3D	Multimodality	2.424	TM2T
3D Human Pose Tracking	HumanML3D	R Precision Top3	0.729	TM2T
3D Human Pose Tracking	HumanML3D	Diversity	6.409	Text2Gesture
3D Human Pose Tracking	HumanML3D	FID	5.012	Text2Gesture
3D Human Pose Tracking	HumanML3D	R Precision Top3	0.345	Text2Gesture
3D Human Pose Tracking	HumanML3D	Diversity	7.676	Language2Pose
3D Human Pose Tracking	HumanML3D	FID	11.02	Language2Pose
3D Human Pose Tracking	HumanML3D	R Precision Top3	0.486	Language2Pose
3D Human Pose Tracking	KIT Motion-Language	Diversity	9.473	TM2T
3D Human Pose Tracking	KIT Motion-Language	FID	3.599	TM2T
3D Human Pose Tracking	KIT Motion-Language	Multimodality	3.292	TM2T
3D Human Pose Tracking	KIT Motion-Language	R Precision Top3	0.587	TM2T
3D Human Pose Tracking	KIT Motion-Language	Diversity	9.073	Language2Pose
3D Human Pose Tracking	KIT Motion-Language	FID	6.545	Language2Pose
3D Human Pose Tracking	KIT Motion-Language	R Precision Top3	0.483	Language2Pose
3D Human Pose Tracking	KIT Motion-Language	Diversity	9.334	Text2Gesture
3D Human Pose Tracking	KIT Motion-Language	FID	12.12	Text2Gesture
3D Human Pose Tracking	KIT Motion-Language	R Precision Top3	0.338	Text2Gesture

TM2T: Stochastic and Tokenized Modeling for the Reciprocal Generation of 3D Human Motions and Texts

Abstract

Results

Related Papers

TM2T: Stochastic and Tokenized Modeling for the Reciprocal Generation of 3D Human Motions and Texts

Abstract

Results

Related Papers