TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/A Unified Masked Autoencoder with Patchified Skeletons for...

A Unified Masked Autoencoder with Patchified Skeletons for Motion Synthesis

Esteve Valls Mascaro, Hyemin Ahn, Dongheui Lee

2023-08-14Human Pose ForecastingMotion Synthesis
PaperPDF

Abstract

The synthesis of human motion has traditionally been addressed through task-dependent models that focus on specific challenges, such as predicting future motions or filling in intermediate poses conditioned on known key-poses. In this paper, we present a novel task-independent model called UNIMASK-M, which can effectively address these challenges using a unified architecture. Our model obtains comparable or better performance than the state-of-the-art in each field. Inspired by Vision Transformers (ViTs), our UNIMASK-M model decomposes a human pose into body parts to leverage the spatio-temporal relationships existing in human motion. Moreover, we reformulate various pose-conditioned motion synthesis tasks as a reconstruction problem with different masking patterns given as input. By explicitly informing our model about the masked joints, our UNIMASK-M becomes more robust to occlusions. Experimental results show that our model successfully forecasts human motion on the Human3.6M dataset. Moreover, it achieves state-of-the-art results in motion inbetweening on the LaFAN1 dataset, particularly in long transition periods. More information can be found on the project website https://evm7.github.io/UNIMASKM-page/

Results

TaskDatasetMetricValueModel
Pose EstimationHuman3.6MAverage MPJPE (mm) @ 1000 ms112.1UNIMASK-M
Pose EstimationHuman3.6MAverage MPJPE (mm) @ 400ms61.6UNIMASK-M
3DHuman3.6MAverage MPJPE (mm) @ 1000 ms112.1UNIMASK-M
3DHuman3.6MAverage MPJPE (mm) @ 400ms61.6UNIMASK-M
1 Image, 2*2 StitchiHuman3.6MAverage MPJPE (mm) @ 1000 ms112.1UNIMASK-M
1 Image, 2*2 StitchiHuman3.6MAverage MPJPE (mm) @ 400ms61.6UNIMASK-M

Related Papers

DeepGesture: A conversational gesture synthesis system based on emotions and semantics2025-07-03VolumetricSMPL: A Neural Volumetric Body Model for Efficient Interactions, Contacts, and Collisions2025-06-29DuetGen: Music Driven Two-Person Dance Generation via Hierarchical Masked Modeling2025-06-23PlanMoGPT: Flow-Enhanced Progressive Planning for Text to Motion Synthesis2025-06-22Motion-R1: Chain-of-Thought Reasoning and Reinforcement Learning for Human Motion Generation2025-06-12DanceChat: Large Language Model-Guided Music-to-Dance Generation2025-06-12MotionRAG-Diff: A Retrieval-Augmented Diffusion Framework for Long-Term Music-to-Dance Generation2025-06-03MotionPro: A Precise Motion Controller for Image-to-Video Generation2025-05-26