TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/InterMask: 3D Human Interaction Generation via Collaborati...

InterMask: 3D Human Interaction Generation via Collaborative Masked Modelling

Muhammad Gohar Javed, Chuan Guo, Li Cheng, Xingyu Li

2024-10-13Motion Synthesis
PaperPDFCode(official)

Abstract

Generating realistic 3D human-human interactions from textual descriptions remains a challenging task. Existing approaches, typically based on diffusion models, often generate unnatural and unrealistic results. In this work, we introduce InterMask, a novel framework for generating human interactions using collaborative masked modeling in discrete space. InterMask first employs a VQ-VAE to transform each motion sequence into a 2D discrete motion token map. Unlike traditional 1D VQ token maps, it better preserves fine-grained spatio-temporal details and promotes spatial awareness within each token. Building on this representation, InterMask utilizes a generative masked modeling framework to collaboratively model the tokens of two interacting individuals. This is achieved by employing a transformer architecture specifically designed to capture complex spatio-temporal interdependencies. During training, it randomly masks the motion tokens of both individuals and learns to predict them. In inference, starting from fully masked sequences, it progressively fills in the tokens for both individuals. With its enhanced motion representation, dedicated architecture, and effective learning strategy, InterMask achieves state-of-the-art results, producing high-fidelity and diverse human interactions. It outperforms previous methods, achieving an FID of $5.154$ (vs $5.535$ for in2IN) on the InterHuman dataset and $0.399$ (vs $5.207$ for InterGen) on the InterX dataset. Additionally, InterMask seamlessly supports reaction generation without the need for model redesign or fine-tuning.

Results

TaskDatasetMetricValueModel
Pose TrackingInter-XFID0.399InterMask
Pose TrackingInter-XMMDist3.705InterMask
Pose TrackingInter-XMModality2.261InterMask
Pose TrackingInter-XR-Precision Top30.705InterMask
Pose TrackingInterHumanFID5.154InterMask
Pose TrackingInterHumanMMDist3.79InterMask
Pose TrackingInterHumanMModality1.737InterMask
Pose TrackingInterHumanR-Precision Top30.683InterMask
Motion SynthesisInter-XFID0.399InterMask
Motion SynthesisInter-XMMDist3.705InterMask
Motion SynthesisInter-XMModality2.261InterMask
Motion SynthesisInter-XR-Precision Top30.705InterMask
Motion SynthesisInterHumanFID5.154InterMask
Motion SynthesisInterHumanMMDist3.79InterMask
Motion SynthesisInterHumanMModality1.737InterMask
Motion SynthesisInterHumanR-Precision Top30.683InterMask
10-shot image generationInter-XFID0.399InterMask
10-shot image generationInter-XMMDist3.705InterMask
10-shot image generationInter-XMModality2.261InterMask
10-shot image generationInter-XR-Precision Top30.705InterMask
10-shot image generationInterHumanFID5.154InterMask
10-shot image generationInterHumanMMDist3.79InterMask
10-shot image generationInterHumanMModality1.737InterMask
10-shot image generationInterHumanR-Precision Top30.683InterMask
3D Human Pose TrackingInter-XFID0.399InterMask
3D Human Pose TrackingInter-XMMDist3.705InterMask
3D Human Pose TrackingInter-XMModality2.261InterMask
3D Human Pose TrackingInter-XR-Precision Top30.705InterMask
3D Human Pose TrackingInterHumanFID5.154InterMask
3D Human Pose TrackingInterHumanMMDist3.79InterMask
3D Human Pose TrackingInterHumanMModality1.737InterMask
3D Human Pose TrackingInterHumanR-Precision Top30.683InterMask

Related Papers

DeepGesture: A conversational gesture synthesis system based on emotions and semantics2025-07-03VolumetricSMPL: A Neural Volumetric Body Model for Efficient Interactions, Contacts, and Collisions2025-06-29DuetGen: Music Driven Two-Person Dance Generation via Hierarchical Masked Modeling2025-06-23PlanMoGPT: Flow-Enhanced Progressive Planning for Text to Motion Synthesis2025-06-22Motion-R1: Chain-of-Thought Reasoning and Reinforcement Learning for Human Motion Generation2025-06-12DanceChat: Large Language Model-Guided Music-to-Dance Generation2025-06-12MotionRAG-Diff: A Retrieval-Augmented Diffusion Framework for Long-Term Music-to-Dance Generation2025-06-03MotionPro: A Precise Motion Controller for Image-to-Video Generation2025-05-26