TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/AttT2M: Text-Driven Human Motion Generation with Multi-Per...

AttT2M: Text-Driven Human Motion Generation with Multi-Perspective Attention Mechanism

Chongyang Zhong, Lei Hu, Zihao Zhang, Shihong Xia

2023-09-02ICCV 2023 1Motion GenerationMotion Synthesis
PaperPDFCode(official)

Abstract

Generating 3D human motion based on textual descriptions has been a research focus in recent years. It requires the generated motion to be diverse, natural, and conform to the textual description. Due to the complex spatio-temporal nature of human motion and the difficulty in learning the cross-modal relationship between text and motion, text-driven motion generation is still a challenging problem. To address these issues, we propose \textbf{AttT2M}, a two-stage method with multi-perspective attention mechanism: \textbf{body-part attention} and \textbf{global-local motion-text attention}. The former focuses on the motion embedding perspective, which means introducing a body-part spatio-temporal encoder into VQ-VAE to learn a more expressive discrete latent space. The latter is from the cross-modal perspective, which is used to learn the sentence-level and word-level motion-text cross-modal relationship. The text-driven motion is finally generated with a generative transformer. Extensive experiments conducted on HumanML3D and KIT-ML demonstrate that our method outperforms the current state-of-the-art works in terms of qualitative and quantitative evaluation, and achieve fine-grained synthesis and action2motion. Our code is in https://github.com/ZcyMonkey/AttT2M

Results

TaskDatasetMetricValueModel
Pose TrackingHumanML3DDiversity9.7AttT2M
Pose TrackingHumanML3DFID0.112AttT2M
Pose TrackingHumanML3DMultimodality2.452AttT2M
Pose TrackingHumanML3DR Precision Top30.786AttT2M
Pose TrackingKIT Motion-LanguageDiversity10.96AttT2M
Pose TrackingKIT Motion-LanguageFID0.87AttT2M
Pose TrackingKIT Motion-LanguageMultimodality2.281AttT2M
Pose TrackingKIT Motion-LanguageR Precision Top30.751AttT2M
Motion SynthesisHumanML3DDiversity9.7AttT2M
Motion SynthesisHumanML3DFID0.112AttT2M
Motion SynthesisHumanML3DMultimodality2.452AttT2M
Motion SynthesisHumanML3DR Precision Top30.786AttT2M
Motion SynthesisKIT Motion-LanguageDiversity10.96AttT2M
Motion SynthesisKIT Motion-LanguageFID0.87AttT2M
Motion SynthesisKIT Motion-LanguageMultimodality2.281AttT2M
Motion SynthesisKIT Motion-LanguageR Precision Top30.751AttT2M
10-shot image generationHumanML3DDiversity9.7AttT2M
10-shot image generationHumanML3DFID0.112AttT2M
10-shot image generationHumanML3DMultimodality2.452AttT2M
10-shot image generationHumanML3DR Precision Top30.786AttT2M
10-shot image generationKIT Motion-LanguageDiversity10.96AttT2M
10-shot image generationKIT Motion-LanguageFID0.87AttT2M
10-shot image generationKIT Motion-LanguageMultimodality2.281AttT2M
10-shot image generationKIT Motion-LanguageR Precision Top30.751AttT2M
3D Human Pose TrackingHumanML3DDiversity9.7AttT2M
3D Human Pose TrackingHumanML3DFID0.112AttT2M
3D Human Pose TrackingHumanML3DMultimodality2.452AttT2M
3D Human Pose TrackingHumanML3DR Precision Top30.786AttT2M
3D Human Pose TrackingKIT Motion-LanguageDiversity10.96AttT2M
3D Human Pose TrackingKIT Motion-LanguageFID0.87AttT2M
3D Human Pose TrackingKIT Motion-LanguageMultimodality2.281AttT2M
3D Human Pose TrackingKIT Motion-LanguageR Precision Top30.751AttT2M

Related Papers

SnapMoGen: Human Motion Generation from Expressive Texts2025-07-12Go to Zero: Towards Zero-shot Motion Generation with Million-scale Data2025-07-09Motion Generation: A Survey of Generative Approaches and Benchmarks2025-07-07DeepGesture: A conversational gesture synthesis system based on emotions and semantics2025-07-03A Unified Transformer-Based Framework with Pretraining For Whole Body Grasping Motion Generation2025-07-01VolumetricSMPL: A Neural Volumetric Body Model for Efficient Interactions, Contacts, and Collisions2025-06-29DuetGen: Music Driven Two-Person Dance Generation via Hierarchical Masked Modeling2025-06-23PlanMoGPT: Flow-Enhanced Progressive Planning for Text to Motion Synthesis2025-06-22