BAD: Bidirectional Auto-regressive Diffusion for Text-to-Motion Generation

S. Rohollah Hosseyni, Ali Ahmad Rahmani, S. Jamal Seyedmohammadi, Sanaz Seyedin, Arash Mohammadi

2024-09-17Human motion prediction Motion Forecasting Motion Generation Motion Synthesis

Abstract

Autoregressive models excel in modeling sequential dependencies by enforcing causal constraints, yet they struggle to capture complex bidirectional patterns due to their unidirectional nature. In contrast, mask-based models leverage bidirectional context, enabling richer dependency modeling. However, they often assume token independence during prediction, which undermines the modeling of sequential dependencies. Additionally, the corruption of sequences through masking or absorption can introduce unnatural distortions, complicating the learning process. To address these issues, we propose Bidirectional Autoregressive Diffusion (BAD), a novel approach that unifies the strengths of autoregressive and mask-based generative models. BAD utilizes a permutation-based corruption technique that preserves the natural sequence structure while enforcing causal dependencies through randomized ordering, enabling the effective capture of both sequential and bidirectional relationships. Comprehensive experiments show that BAD outperforms autoregressive and mask-based models in text-to-motion generation, suggesting a novel pre-training strategy for sequence modeling. The codebase for BAD is available on https://github.com/RohollahHS/BAD.

Results

Task	Dataset	Metric	Value	Model
Pose Tracking	HumanML3D	Diversity	9.688	BAD (CBS)
Pose Tracking	HumanML3D	FID	0.049	BAD (CBS)
Pose Tracking	HumanML3D	Multimodality	1.119	BAD (CBS)
Pose Tracking	HumanML3D	R Precision Top3	0.8	BAD (CBS)
Pose Tracking	HumanML3D	Diversity	9.694	BAD (OAAS)
Pose Tracking	HumanML3D	FID	0.065	BAD (OAAS)
Pose Tracking	HumanML3D	Multimodality	1.194	BAD (OAAS)
Pose Tracking	HumanML3D	R Precision Top3	0.808	BAD (OAAS)
Pose Tracking	KIT Motion-Language	Diversity	11	BAD (OAAS)
Pose Tracking	KIT Motion-Language	FID	0.221	BAD (OAAS)
Pose Tracking	KIT Motion-Language	Multimodality	1.17	BAD (OAAS)
Pose Tracking	KIT Motion-Language	R Precision Top3	0.75	BAD (OAAS)
Motion Synthesis	HumanML3D	Diversity	9.688	BAD (CBS)
Motion Synthesis	HumanML3D	FID	0.049	BAD (CBS)
Motion Synthesis	HumanML3D	Multimodality	1.119	BAD (CBS)
Motion Synthesis	HumanML3D	R Precision Top3	0.8	BAD (CBS)
Motion Synthesis	HumanML3D	Diversity	9.694	BAD (OAAS)
Motion Synthesis	HumanML3D	FID	0.065	BAD (OAAS)
Motion Synthesis	HumanML3D	Multimodality	1.194	BAD (OAAS)
Motion Synthesis	HumanML3D	R Precision Top3	0.808	BAD (OAAS)
Motion Synthesis	KIT Motion-Language	Diversity	11	BAD (OAAS)
Motion Synthesis	KIT Motion-Language	FID	0.221	BAD (OAAS)
Motion Synthesis	KIT Motion-Language	Multimodality	1.17	BAD (OAAS)
Motion Synthesis	KIT Motion-Language	R Precision Top3	0.75	BAD (OAAS)
10-shot image generation	HumanML3D	Diversity	9.688	BAD (CBS)
10-shot image generation	HumanML3D	FID	0.049	BAD (CBS)
10-shot image generation	HumanML3D	Multimodality	1.119	BAD (CBS)
10-shot image generation	HumanML3D	R Precision Top3	0.8	BAD (CBS)
10-shot image generation	HumanML3D	Diversity	9.694	BAD (OAAS)
10-shot image generation	HumanML3D	FID	0.065	BAD (OAAS)
10-shot image generation	HumanML3D	Multimodality	1.194	BAD (OAAS)
10-shot image generation	HumanML3D	R Precision Top3	0.808	BAD (OAAS)
10-shot image generation	KIT Motion-Language	Diversity	11	BAD (OAAS)
10-shot image generation	KIT Motion-Language	FID	0.221	BAD (OAAS)
10-shot image generation	KIT Motion-Language	Multimodality	1.17	BAD (OAAS)
10-shot image generation	KIT Motion-Language	R Precision Top3	0.75	BAD (OAAS)
3D Human Pose Tracking	HumanML3D	Diversity	9.688	BAD (CBS)
3D Human Pose Tracking	HumanML3D	FID	0.049	BAD (CBS)
3D Human Pose Tracking	HumanML3D	Multimodality	1.119	BAD (CBS)
3D Human Pose Tracking	HumanML3D	R Precision Top3	0.8	BAD (CBS)
3D Human Pose Tracking	HumanML3D	Diversity	9.694	BAD (OAAS)
3D Human Pose Tracking	HumanML3D	FID	0.065	BAD (OAAS)
3D Human Pose Tracking	HumanML3D	Multimodality	1.194	BAD (OAAS)
3D Human Pose Tracking	HumanML3D	R Precision Top3	0.808	BAD (OAAS)
3D Human Pose Tracking	KIT Motion-Language	Diversity	11	BAD (OAAS)
3D Human Pose Tracking	KIT Motion-Language	FID	0.221	BAD (OAAS)
3D Human Pose Tracking	KIT Motion-Language	Multimodality	1.17	BAD (OAAS)
3D Human Pose Tracking	KIT Motion-Language	R Precision Top3	0.75	BAD (OAAS)

BAD: Bidirectional Auto-regressive Diffusion for Text-to-Motion Generation

Abstract

Results

Related Papers

BAD: Bidirectional Auto-regressive Diffusion for Text-to-Motion Generation

Abstract

Results

Related Papers