TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/T2M-GPT: Generating Human Motion from Textual Descriptions...

T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete Representations

Jianrong Zhang, Yangsong Zhang, Xiaodong Cun, Shaoli Huang, Yong Zhang, Hongwei Zhao, Hongtao Lu, Xi Shen

2023-01-15Motion GenerationMotion Synthesis
PaperPDFCode(official)

Abstract

In this work, we investigate a simple and must-known conditional generative framework based on Vector Quantised-Variational AutoEncoder (VQ-VAE) and Generative Pre-trained Transformer (GPT) for human motion generation from textural descriptions. We show that a simple CNN-based VQ-VAE with commonly used training recipes (EMA and Code Reset) allows us to obtain high-quality discrete representations. For GPT, we incorporate a simple corruption strategy during the training to alleviate training-testing discrepancy. Despite its simplicity, our T2M-GPT shows better performance than competitive approaches, including recent diffusion-based approaches. For example, on HumanML3D, which is currently the largest dataset, we achieve comparable performance on the consistency between text and generated motion (R-Precision), but with FID 0.116 largely outperforming MotionDiffuse of 0.630. Additionally, we conduct analyses on HumanML3D and observe that the dataset size is a limitation of our approach. Our work suggests that VQ-VAE still remains a competitive approach for human motion generation.

Results

TaskDatasetMetricValueModel
Pose TrackingHumanML3DDiversity9.761T2M-GPT (τ = 0.5)
Pose TrackingHumanML3DFID0.116T2M-GPT (τ = 0.5)
Pose TrackingHumanML3DMultimodality1.856T2M-GPT (τ = 0.5)
Pose TrackingHumanML3DR Precision Top30.775T2M-GPT (τ = 0.5)
Pose TrackingHumanML3DDiversity9.844T2M-GPT (τ = 0)
Pose TrackingHumanML3DFID0.14T2M-GPT (τ = 0)
Pose TrackingHumanML3DMultimodality3.285T2M-GPT (τ = 0)
Pose TrackingHumanML3DR Precision Top30.685T2M-GPT (τ = 0)
Pose TrackingHumanML3DDiversity9.722T2M-GPT (τ ∈ U[0, 1])
Pose TrackingHumanML3DFID0.141T2M-GPT (τ ∈ U[0, 1])
Pose TrackingHumanML3DMultimodality1.831T2M-GPT (τ ∈ U[0, 1])
Pose TrackingHumanML3DR Precision Top30.775T2M-GPT (τ ∈ U[0, 1])
Pose TrackingMotion-XDiversity10.753T2M-GPT
Pose TrackingMotion-XFID1.366T2M-GPT
Pose TrackingMotion-XMModality2.356T2M-GPT
Pose TrackingMotion-XTMR-Matching Score0.881T2M-GPT
Pose TrackingMotion-XTMR-R-Precision Top30.655T2M-GPT
Pose TrackingKIT Motion-LanguageDiversity10.921T2M-GPT (τ ∈ U[0, 1])
Pose TrackingKIT Motion-LanguageFID0.514T2M-GPT (τ ∈ U[0, 1])
Pose TrackingKIT Motion-LanguageMultimodality1.57T2M-GPT (τ ∈ U[0, 1])
Pose TrackingKIT Motion-LanguageR Precision Top30.745T2M-GPT (τ ∈ U[0, 1])
Pose TrackingKIT Motion-LanguageDiversity10.862T2M-GPT (τ = 0.5)
Pose TrackingKIT Motion-LanguageFID0.717T2M-GPT (τ = 0.5)
Pose TrackingKIT Motion-LanguageMultimodality1.912T2M-GPT (τ = 0.5)
Pose TrackingKIT Motion-LanguageR Precision Top30.737T2M-GPT (τ = 0.5)
Pose TrackingKIT Motion-LanguageDiversity11.198T2M-GPT (τ = 0)
Pose TrackingKIT Motion-LanguageFID0.737T2M-GPT (τ = 0)
Pose TrackingKIT Motion-LanguageMultimodality2.309T2M-GPT (τ = 0)
Pose TrackingKIT Motion-LanguageR Precision Top30.716T2M-GPT (τ = 0)
Motion SynthesisHumanML3DDiversity9.761T2M-GPT (τ = 0.5)
Motion SynthesisHumanML3DFID0.116T2M-GPT (τ = 0.5)
Motion SynthesisHumanML3DMultimodality1.856T2M-GPT (τ = 0.5)
Motion SynthesisHumanML3DR Precision Top30.775T2M-GPT (τ = 0.5)
Motion SynthesisHumanML3DDiversity9.844T2M-GPT (τ = 0)
Motion SynthesisHumanML3DFID0.14T2M-GPT (τ = 0)
Motion SynthesisHumanML3DMultimodality3.285T2M-GPT (τ = 0)
Motion SynthesisHumanML3DR Precision Top30.685T2M-GPT (τ = 0)
Motion SynthesisHumanML3DDiversity9.722T2M-GPT (τ ∈ U[0, 1])
Motion SynthesisHumanML3DFID0.141T2M-GPT (τ ∈ U[0, 1])
Motion SynthesisHumanML3DMultimodality1.831T2M-GPT (τ ∈ U[0, 1])
Motion SynthesisHumanML3DR Precision Top30.775T2M-GPT (τ ∈ U[0, 1])
Motion SynthesisMotion-XDiversity10.753T2M-GPT
Motion SynthesisMotion-XFID1.366T2M-GPT
Motion SynthesisMotion-XMModality2.356T2M-GPT
Motion SynthesisMotion-XTMR-Matching Score0.881T2M-GPT
Motion SynthesisMotion-XTMR-R-Precision Top30.655T2M-GPT
Motion SynthesisKIT Motion-LanguageDiversity10.921T2M-GPT (τ ∈ U[0, 1])
Motion SynthesisKIT Motion-LanguageFID0.514T2M-GPT (τ ∈ U[0, 1])
Motion SynthesisKIT Motion-LanguageMultimodality1.57T2M-GPT (τ ∈ U[0, 1])
Motion SynthesisKIT Motion-LanguageR Precision Top30.745T2M-GPT (τ ∈ U[0, 1])
Motion SynthesisKIT Motion-LanguageDiversity10.862T2M-GPT (τ = 0.5)
Motion SynthesisKIT Motion-LanguageFID0.717T2M-GPT (τ = 0.5)
Motion SynthesisKIT Motion-LanguageMultimodality1.912T2M-GPT (τ = 0.5)
Motion SynthesisKIT Motion-LanguageR Precision Top30.737T2M-GPT (τ = 0.5)
Motion SynthesisKIT Motion-LanguageDiversity11.198T2M-GPT (τ = 0)
Motion SynthesisKIT Motion-LanguageFID0.737T2M-GPT (τ = 0)
Motion SynthesisKIT Motion-LanguageMultimodality2.309T2M-GPT (τ = 0)
Motion SynthesisKIT Motion-LanguageR Precision Top30.716T2M-GPT (τ = 0)
10-shot image generationHumanML3DDiversity9.761T2M-GPT (τ = 0.5)
10-shot image generationHumanML3DFID0.116T2M-GPT (τ = 0.5)
10-shot image generationHumanML3DMultimodality1.856T2M-GPT (τ = 0.5)
10-shot image generationHumanML3DR Precision Top30.775T2M-GPT (τ = 0.5)
10-shot image generationHumanML3DDiversity9.844T2M-GPT (τ = 0)
10-shot image generationHumanML3DFID0.14T2M-GPT (τ = 0)
10-shot image generationHumanML3DMultimodality3.285T2M-GPT (τ = 0)
10-shot image generationHumanML3DR Precision Top30.685T2M-GPT (τ = 0)
10-shot image generationHumanML3DDiversity9.722T2M-GPT (τ ∈ U[0, 1])
10-shot image generationHumanML3DFID0.141T2M-GPT (τ ∈ U[0, 1])
10-shot image generationHumanML3DMultimodality1.831T2M-GPT (τ ∈ U[0, 1])
10-shot image generationHumanML3DR Precision Top30.775T2M-GPT (τ ∈ U[0, 1])
10-shot image generationMotion-XDiversity10.753T2M-GPT
10-shot image generationMotion-XFID1.366T2M-GPT
10-shot image generationMotion-XMModality2.356T2M-GPT
10-shot image generationMotion-XTMR-Matching Score0.881T2M-GPT
10-shot image generationMotion-XTMR-R-Precision Top30.655T2M-GPT
10-shot image generationKIT Motion-LanguageDiversity10.921T2M-GPT (τ ∈ U[0, 1])
10-shot image generationKIT Motion-LanguageFID0.514T2M-GPT (τ ∈ U[0, 1])
10-shot image generationKIT Motion-LanguageMultimodality1.57T2M-GPT (τ ∈ U[0, 1])
10-shot image generationKIT Motion-LanguageR Precision Top30.745T2M-GPT (τ ∈ U[0, 1])
10-shot image generationKIT Motion-LanguageDiversity10.862T2M-GPT (τ = 0.5)
10-shot image generationKIT Motion-LanguageFID0.717T2M-GPT (τ = 0.5)
10-shot image generationKIT Motion-LanguageMultimodality1.912T2M-GPT (τ = 0.5)
10-shot image generationKIT Motion-LanguageR Precision Top30.737T2M-GPT (τ = 0.5)
10-shot image generationKIT Motion-LanguageDiversity11.198T2M-GPT (τ = 0)
10-shot image generationKIT Motion-LanguageFID0.737T2M-GPT (τ = 0)
10-shot image generationKIT Motion-LanguageMultimodality2.309T2M-GPT (τ = 0)
10-shot image generationKIT Motion-LanguageR Precision Top30.716T2M-GPT (τ = 0)
3D Human Pose TrackingHumanML3DDiversity9.761T2M-GPT (τ = 0.5)
3D Human Pose TrackingHumanML3DFID0.116T2M-GPT (τ = 0.5)
3D Human Pose TrackingHumanML3DMultimodality1.856T2M-GPT (τ = 0.5)
3D Human Pose TrackingHumanML3DR Precision Top30.775T2M-GPT (τ = 0.5)
3D Human Pose TrackingHumanML3DDiversity9.844T2M-GPT (τ = 0)
3D Human Pose TrackingHumanML3DFID0.14T2M-GPT (τ = 0)
3D Human Pose TrackingHumanML3DMultimodality3.285T2M-GPT (τ = 0)
3D Human Pose TrackingHumanML3DR Precision Top30.685T2M-GPT (τ = 0)
3D Human Pose TrackingHumanML3DDiversity9.722T2M-GPT (τ ∈ U[0, 1])
3D Human Pose TrackingHumanML3DFID0.141T2M-GPT (τ ∈ U[0, 1])
3D Human Pose TrackingHumanML3DMultimodality1.831T2M-GPT (τ ∈ U[0, 1])
3D Human Pose TrackingHumanML3DR Precision Top30.775T2M-GPT (τ ∈ U[0, 1])
3D Human Pose TrackingMotion-XDiversity10.753T2M-GPT
3D Human Pose TrackingMotion-XFID1.366T2M-GPT
3D Human Pose TrackingMotion-XMModality2.356T2M-GPT
3D Human Pose TrackingMotion-XTMR-Matching Score0.881T2M-GPT
3D Human Pose TrackingMotion-XTMR-R-Precision Top30.655T2M-GPT
3D Human Pose TrackingKIT Motion-LanguageDiversity10.921T2M-GPT (τ ∈ U[0, 1])
3D Human Pose TrackingKIT Motion-LanguageFID0.514T2M-GPT (τ ∈ U[0, 1])
3D Human Pose TrackingKIT Motion-LanguageMultimodality1.57T2M-GPT (τ ∈ U[0, 1])
3D Human Pose TrackingKIT Motion-LanguageR Precision Top30.745T2M-GPT (τ ∈ U[0, 1])
3D Human Pose TrackingKIT Motion-LanguageDiversity10.862T2M-GPT (τ = 0.5)
3D Human Pose TrackingKIT Motion-LanguageFID0.717T2M-GPT (τ = 0.5)
3D Human Pose TrackingKIT Motion-LanguageMultimodality1.912T2M-GPT (τ = 0.5)
3D Human Pose TrackingKIT Motion-LanguageR Precision Top30.737T2M-GPT (τ = 0.5)
3D Human Pose TrackingKIT Motion-LanguageDiversity11.198T2M-GPT (τ = 0)
3D Human Pose TrackingKIT Motion-LanguageFID0.737T2M-GPT (τ = 0)
3D Human Pose TrackingKIT Motion-LanguageMultimodality2.309T2M-GPT (τ = 0)
3D Human Pose TrackingKIT Motion-LanguageR Precision Top30.716T2M-GPT (τ = 0)

Related Papers

SnapMoGen: Human Motion Generation from Expressive Texts2025-07-12Go to Zero: Towards Zero-shot Motion Generation with Million-scale Data2025-07-09Motion Generation: A Survey of Generative Approaches and Benchmarks2025-07-07DeepGesture: A conversational gesture synthesis system based on emotions and semantics2025-07-03A Unified Transformer-Based Framework with Pretraining For Whole Body Grasping Motion Generation2025-07-01VolumetricSMPL: A Neural Volumetric Body Model for Efficient Interactions, Contacts, and Collisions2025-06-29DuetGen: Music Driven Two-Person Dance Generation via Hierarchical Masked Modeling2025-06-23PlanMoGPT: Flow-Enhanced Progressive Planning for Text to Motion Synthesis2025-06-22