TEMOS: Generating diverse human motions from textual descriptions

Mathis Petrovich, Michael J. Black, Gül Varol

2022-04-25Motion Synthesis

Abstract

We address the problem of generating diverse 3D human motions from textual descriptions. This challenging task requires joint modeling of both modalities: understanding and extracting useful human-centric information from the text, and then generating plausible and realistic sequences of human poses. In contrast to most previous work which focuses on generating a single, deterministic, motion from a textual description, we design a variational approach that can produce multiple diverse human motions. We propose TEMOS, a text-conditioned generative model leveraging variational autoencoder (VAE) training with human motion data, in combination with a text encoder that produces distribution parameters compatible with the VAE latent space. We show the TEMOS framework can produce both skeleton-based animations as in prior work, as well more expressive SMPL body motions. We evaluate our approach on the KIT Motion-Language benchmark and, despite being relatively straightforward, demonstrate significant improvements over the state of the art. Code and models are available on our webpage.

Results

Task	Dataset	Metric	Value	Model
Pose Tracking	Inter-X	FID	29.258	TEMOS
Pose Tracking	Inter-X	MMDist	6.867	TEMOS
Pose Tracking	Inter-X	MModality	0.672	TEMOS
Pose Tracking	Inter-X	R-Precision Top3	0.238	TEMOS
Pose Tracking	InterHuman	FID	17.375	TEMOS
Pose Tracking	InterHuman	MMDist	6.342	TEMOS
Pose Tracking	InterHuman	MModality	0.535	TEMOS
Pose Tracking	InterHuman	R-Precision Top3	0.45	TEMOS
Motion Synthesis	Inter-X	FID	29.258	TEMOS
Motion Synthesis	Inter-X	MMDist	6.867	TEMOS
Motion Synthesis	Inter-X	MModality	0.672	TEMOS
Motion Synthesis	Inter-X	R-Precision Top3	0.238	TEMOS
Motion Synthesis	InterHuman	FID	17.375	TEMOS
Motion Synthesis	InterHuman	MMDist	6.342	TEMOS
Motion Synthesis	InterHuman	MModality	0.535	TEMOS
Motion Synthesis	InterHuman	R-Precision Top3	0.45	TEMOS
10-shot image generation	Inter-X	FID	29.258	TEMOS
10-shot image generation	Inter-X	MMDist	6.867	TEMOS
10-shot image generation	Inter-X	MModality	0.672	TEMOS
10-shot image generation	Inter-X	R-Precision Top3	0.238	TEMOS
10-shot image generation	InterHuman	FID	17.375	TEMOS
10-shot image generation	InterHuman	MMDist	6.342	TEMOS
10-shot image generation	InterHuman	MModality	0.535	TEMOS
10-shot image generation	InterHuman	R-Precision Top3	0.45	TEMOS
3D Human Pose Tracking	Inter-X	FID	29.258	TEMOS
3D Human Pose Tracking	Inter-X	MMDist	6.867	TEMOS
3D Human Pose Tracking	Inter-X	MModality	0.672	TEMOS
3D Human Pose Tracking	Inter-X	R-Precision Top3	0.238	TEMOS
3D Human Pose Tracking	InterHuman	FID	17.375	TEMOS
3D Human Pose Tracking	InterHuman	MMDist	6.342	TEMOS
3D Human Pose Tracking	InterHuman	MModality	0.535	TEMOS
3D Human Pose Tracking	InterHuman	R-Precision Top3	0.45	TEMOS

TEMOS: Generating diverse human motions from textual descriptions

Abstract

Results

Related Papers

TEMOS: Generating diverse human motions from textual descriptions

Abstract

Results

Related Papers