TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Make-An-Animation: Large-Scale Text-conditional 3D Human M...

Make-An-Animation: Large-Scale Text-conditional 3D Human Motion Generation

Samaneh Azadi, Akbar Shah, Thomas Hayes, Devi Parikh, Sonal Gupta

2023-05-16ICCV 2023 1Text-to-Video GenerationMotion GenerationMotion SynthesisVideo Generation
PaperPDF

Abstract

Text-guided human motion generation has drawn significant interest because of its impactful applications spanning animation and robotics. Recently, application of diffusion models for motion generation has enabled improvements in the quality of generated motions. However, existing approaches are limited by their reliance on relatively small-scale motion capture data, leading to poor performance on more diverse, in-the-wild prompts. In this paper, we introduce Make-An-Animation, a text-conditioned human motion generation model which learns more diverse poses and prompts from large-scale image-text datasets, enabling significant improvement in performance over prior works. Make-An-Animation is trained in two stages. First, we train on a curated large-scale dataset of (text, static pseudo-pose) pairs extracted from image-text datasets. Second, we fine-tune on motion capture data, adding additional layers to model the temporal dimension. Unlike prior diffusion models for motion generation, Make-An-Animation uses a U-Net architecture similar to recent text-to-video generation models. Human evaluation of motion realism and alignment with input text shows that our model reaches state-of-the-art performance on text-to-motion generation.

Results

TaskDatasetMetricValueModel
Pose TrackingHumanML3DDiversity8.23MAA
Pose TrackingHumanML3DFID0.774MAA
Pose TrackingHumanML3DR Precision Top30.676MAA
Motion SynthesisHumanML3DDiversity8.23MAA
Motion SynthesisHumanML3DFID0.774MAA
Motion SynthesisHumanML3DR Precision Top30.676MAA
10-shot image generationHumanML3DDiversity8.23MAA
10-shot image generationHumanML3DFID0.774MAA
10-shot image generationHumanML3DR Precision Top30.676MAA
3D Human Pose TrackingHumanML3DDiversity8.23MAA
3D Human Pose TrackingHumanML3DFID0.774MAA
3D Human Pose TrackingHumanML3DR Precision Top30.676MAA

Related Papers

LoViC: Efficient Long Video Generation with Context Compression2025-07-17World Model-Based End-to-End Scene Generation for Accident Anticipation in Autonomous Driving2025-07-17Leveraging Pre-Trained Visual Models for AI-Generated Video Detection2025-07-17Taming Diffusion Transformer for Real-Time Mobile Video Generation2025-07-17SnapMoGen: Human Motion Generation from Expressive Texts2025-07-12$I^{2}$-World: Intra-Inter Tokenization for Efficient Dynamic 4D Scene Forecasting2025-07-12Lumos-1: On Autoregressive Video Generation from a Unified Model Perspective2025-07-11Scaling RL to Long Videos2025-07-10