TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/TM2T: Stochastic and Tokenized Modeling for the Reciprocal...

TM2T: Stochastic and Tokenized Modeling for the Reciprocal Generation of 3D Human Motions and Texts

Chuan Guo, Xinxin Zuo, Sen Wang, Li Cheng

2022-07-04Machine TranslationNMTMotion CaptioningMotion Synthesis
PaperPDFCode(official)

Abstract

Inspired by the strong ties between vision and language, the two intimate human sensing and communication modalities, our paper aims to explore the generation of 3D human full-body motions from texts, as well as its reciprocal task, shorthanded for text2motion and motion2text, respectively. To tackle the existing challenges, especially to enable the generation of multiple distinct motions from the same text, and to avoid the undesirable production of trivial motionless pose sequences, we propose the use of motion token, a discrete and compact motion representation. This provides one level playing ground when considering both motions and text signals, as the motion and text tokens, respectively. Moreover, our motion2text module is integrated into the inverse alignment process of our text2motion training pipeline, where a significant deviation of synthesized text from the input text would be penalized by a large training loss; empirically this is shown to effectively improve performance. Finally, the mappings in-between the two modalities of motions and texts are facilitated by adapting the neural model for machine translation (NMT) to our context. This autoregressive modeling of the distribution over discrete motion tokens further enables non-deterministic production of pose sequences, of variable lengths, from an input text. Our approach is flexible, could be used for both text2motion and motion2text tasks. Empirical evaluations on two benchmark datasets demonstrate the superior performance of our approach on both tasks over a variety of state-of-the-art methods. Project page: https://ericguo5513.github.io/TM2T/

Results

TaskDatasetMetricValueModel
Pose TrackingHumanML3DDiversity8.589TM2T
Pose TrackingHumanML3DFID1.501TM2T
Pose TrackingHumanML3DMultimodality2.424TM2T
Pose TrackingHumanML3DR Precision Top30.729TM2T
Pose TrackingHumanML3DDiversity6.409Text2Gesture
Pose TrackingHumanML3DFID5.012Text2Gesture
Pose TrackingHumanML3DR Precision Top30.345Text2Gesture
Pose TrackingHumanML3DDiversity7.676Language2Pose
Pose TrackingHumanML3DFID11.02Language2Pose
Pose TrackingHumanML3DR Precision Top30.486Language2Pose
Pose TrackingKIT Motion-LanguageDiversity9.473TM2T
Pose TrackingKIT Motion-LanguageFID3.599TM2T
Pose TrackingKIT Motion-LanguageMultimodality3.292TM2T
Pose TrackingKIT Motion-LanguageR Precision Top30.587TM2T
Pose TrackingKIT Motion-LanguageDiversity9.073Language2Pose
Pose TrackingKIT Motion-LanguageFID6.545Language2Pose
Pose TrackingKIT Motion-LanguageR Precision Top30.483Language2Pose
Pose TrackingKIT Motion-LanguageDiversity9.334Text2Gesture
Pose TrackingKIT Motion-LanguageFID12.12Text2Gesture
Pose TrackingKIT Motion-LanguageR Precision Top30.338Text2Gesture
Motion SynthesisHumanML3DDiversity8.589TM2T
Motion SynthesisHumanML3DFID1.501TM2T
Motion SynthesisHumanML3DMultimodality2.424TM2T
Motion SynthesisHumanML3DR Precision Top30.729TM2T
Motion SynthesisHumanML3DDiversity6.409Text2Gesture
Motion SynthesisHumanML3DFID5.012Text2Gesture
Motion SynthesisHumanML3DR Precision Top30.345Text2Gesture
Motion SynthesisHumanML3DDiversity7.676Language2Pose
Motion SynthesisHumanML3DFID11.02Language2Pose
Motion SynthesisHumanML3DR Precision Top30.486Language2Pose
Motion SynthesisKIT Motion-LanguageDiversity9.473TM2T
Motion SynthesisKIT Motion-LanguageFID3.599TM2T
Motion SynthesisKIT Motion-LanguageMultimodality3.292TM2T
Motion SynthesisKIT Motion-LanguageR Precision Top30.587TM2T
Motion SynthesisKIT Motion-LanguageDiversity9.073Language2Pose
Motion SynthesisKIT Motion-LanguageFID6.545Language2Pose
Motion SynthesisKIT Motion-LanguageR Precision Top30.483Language2Pose
Motion SynthesisKIT Motion-LanguageDiversity9.334Text2Gesture
Motion SynthesisKIT Motion-LanguageFID12.12Text2Gesture
Motion SynthesisKIT Motion-LanguageR Precision Top30.338Text2Gesture
Motion CaptioningHumanML3DBERTScore37.8TM2T
Motion CaptioningHumanML3DBLEU-422.3TM2T
Motion CaptioningKIT Motion-LanguageBERTScore23TM2T
Motion CaptioningKIT Motion-LanguageBLEU-418.4TM2T
10-shot image generationHumanML3DDiversity8.589TM2T
10-shot image generationHumanML3DFID1.501TM2T
10-shot image generationHumanML3DMultimodality2.424TM2T
10-shot image generationHumanML3DR Precision Top30.729TM2T
10-shot image generationHumanML3DDiversity6.409Text2Gesture
10-shot image generationHumanML3DFID5.012Text2Gesture
10-shot image generationHumanML3DR Precision Top30.345Text2Gesture
10-shot image generationHumanML3DDiversity7.676Language2Pose
10-shot image generationHumanML3DFID11.02Language2Pose
10-shot image generationHumanML3DR Precision Top30.486Language2Pose
10-shot image generationKIT Motion-LanguageDiversity9.473TM2T
10-shot image generationKIT Motion-LanguageFID3.599TM2T
10-shot image generationKIT Motion-LanguageMultimodality3.292TM2T
10-shot image generationKIT Motion-LanguageR Precision Top30.587TM2T
10-shot image generationKIT Motion-LanguageDiversity9.073Language2Pose
10-shot image generationKIT Motion-LanguageFID6.545Language2Pose
10-shot image generationKIT Motion-LanguageR Precision Top30.483Language2Pose
10-shot image generationKIT Motion-LanguageDiversity9.334Text2Gesture
10-shot image generationKIT Motion-LanguageFID12.12Text2Gesture
10-shot image generationKIT Motion-LanguageR Precision Top30.338Text2Gesture
3D Human Pose TrackingHumanML3DDiversity8.589TM2T
3D Human Pose TrackingHumanML3DFID1.501TM2T
3D Human Pose TrackingHumanML3DMultimodality2.424TM2T
3D Human Pose TrackingHumanML3DR Precision Top30.729TM2T
3D Human Pose TrackingHumanML3DDiversity6.409Text2Gesture
3D Human Pose TrackingHumanML3DFID5.012Text2Gesture
3D Human Pose TrackingHumanML3DR Precision Top30.345Text2Gesture
3D Human Pose TrackingHumanML3DDiversity7.676Language2Pose
3D Human Pose TrackingHumanML3DFID11.02Language2Pose
3D Human Pose TrackingHumanML3DR Precision Top30.486Language2Pose
3D Human Pose TrackingKIT Motion-LanguageDiversity9.473TM2T
3D Human Pose TrackingKIT Motion-LanguageFID3.599TM2T
3D Human Pose TrackingKIT Motion-LanguageMultimodality3.292TM2T
3D Human Pose TrackingKIT Motion-LanguageR Precision Top30.587TM2T
3D Human Pose TrackingKIT Motion-LanguageDiversity9.073Language2Pose
3D Human Pose TrackingKIT Motion-LanguageFID6.545Language2Pose
3D Human Pose TrackingKIT Motion-LanguageR Precision Top30.483Language2Pose
3D Human Pose TrackingKIT Motion-LanguageDiversity9.334Text2Gesture
3D Human Pose TrackingKIT Motion-LanguageFID12.12Text2Gesture
3D Human Pose TrackingKIT Motion-LanguageR Precision Top30.338Text2Gesture

Related Papers

Speak2Sign3D: A Multi-modal Pipeline for English Speech to American Sign Language Animation2025-07-09Pun Intended: Multi-Agent Translation of Wordplay with Contrastive Learning and Phonetic-Semantic Embeddings2025-07-09GRAFT: A Graph-based Flow-aware Agentic Framework for Document-level Machine Translation2025-07-04DeepGesture: A conversational gesture synthesis system based on emotions and semantics2025-07-03TransLaw: Benchmarking Large Language Models in Multi-Agent Simulation of the Collaborative Translation2025-07-01VolumetricSMPL: A Neural Volumetric Body Model for Efficient Interactions, Contacts, and Collisions2025-06-29Enhancing Automatic Term Extraction with Large Language Models via Syntactic Retrieval2025-06-26Intrinsic vs. Extrinsic Evaluation of Czech Sentence Embeddings: Semantic Relevance Doesn't Help with MT Evaluation2025-06-25