S. Rohollah Hosseyni, Ali Ahmad Rahmani, S. Jamal Seyedmohammadi, Sanaz Seyedin, Arash Mohammadi
Autoregressive models excel in modeling sequential dependencies by enforcing causal constraints, yet they struggle to capture complex bidirectional patterns due to their unidirectional nature. In contrast, mask-based models leverage bidirectional context, enabling richer dependency modeling. However, they often assume token independence during prediction, which undermines the modeling of sequential dependencies. Additionally, the corruption of sequences through masking or absorption can introduce unnatural distortions, complicating the learning process. To address these issues, we propose Bidirectional Autoregressive Diffusion (BAD), a novel approach that unifies the strengths of autoregressive and mask-based generative models. BAD utilizes a permutation-based corruption technique that preserves the natural sequence structure while enforcing causal dependencies through randomized ordering, enabling the effective capture of both sequential and bidirectional relationships. Comprehensive experiments show that BAD outperforms autoregressive and mask-based models in text-to-motion generation, suggesting a novel pre-training strategy for sequence modeling. The codebase for BAD is available on https://github.com/RohollahHS/BAD.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Pose Tracking | HumanML3D | Diversity | 9.688 | BAD (CBS) |
| Pose Tracking | HumanML3D | FID | 0.049 | BAD (CBS) |
| Pose Tracking | HumanML3D | Multimodality | 1.119 | BAD (CBS) |
| Pose Tracking | HumanML3D | R Precision Top3 | 0.8 | BAD (CBS) |
| Pose Tracking | HumanML3D | Diversity | 9.694 | BAD (OAAS) |
| Pose Tracking | HumanML3D | FID | 0.065 | BAD (OAAS) |
| Pose Tracking | HumanML3D | Multimodality | 1.194 | BAD (OAAS) |
| Pose Tracking | HumanML3D | R Precision Top3 | 0.808 | BAD (OAAS) |
| Pose Tracking | KIT Motion-Language | Diversity | 11 | BAD (OAAS) |
| Pose Tracking | KIT Motion-Language | FID | 0.221 | BAD (OAAS) |
| Pose Tracking | KIT Motion-Language | Multimodality | 1.17 | BAD (OAAS) |
| Pose Tracking | KIT Motion-Language | R Precision Top3 | 0.75 | BAD (OAAS) |
| Motion Synthesis | HumanML3D | Diversity | 9.688 | BAD (CBS) |
| Motion Synthesis | HumanML3D | FID | 0.049 | BAD (CBS) |
| Motion Synthesis | HumanML3D | Multimodality | 1.119 | BAD (CBS) |
| Motion Synthesis | HumanML3D | R Precision Top3 | 0.8 | BAD (CBS) |
| Motion Synthesis | HumanML3D | Diversity | 9.694 | BAD (OAAS) |
| Motion Synthesis | HumanML3D | FID | 0.065 | BAD (OAAS) |
| Motion Synthesis | HumanML3D | Multimodality | 1.194 | BAD (OAAS) |
| Motion Synthesis | HumanML3D | R Precision Top3 | 0.808 | BAD (OAAS) |
| Motion Synthesis | KIT Motion-Language | Diversity | 11 | BAD (OAAS) |
| Motion Synthesis | KIT Motion-Language | FID | 0.221 | BAD (OAAS) |
| Motion Synthesis | KIT Motion-Language | Multimodality | 1.17 | BAD (OAAS) |
| Motion Synthesis | KIT Motion-Language | R Precision Top3 | 0.75 | BAD (OAAS) |
| 10-shot image generation | HumanML3D | Diversity | 9.688 | BAD (CBS) |
| 10-shot image generation | HumanML3D | FID | 0.049 | BAD (CBS) |
| 10-shot image generation | HumanML3D | Multimodality | 1.119 | BAD (CBS) |
| 10-shot image generation | HumanML3D | R Precision Top3 | 0.8 | BAD (CBS) |
| 10-shot image generation | HumanML3D | Diversity | 9.694 | BAD (OAAS) |
| 10-shot image generation | HumanML3D | FID | 0.065 | BAD (OAAS) |
| 10-shot image generation | HumanML3D | Multimodality | 1.194 | BAD (OAAS) |
| 10-shot image generation | HumanML3D | R Precision Top3 | 0.808 | BAD (OAAS) |
| 10-shot image generation | KIT Motion-Language | Diversity | 11 | BAD (OAAS) |
| 10-shot image generation | KIT Motion-Language | FID | 0.221 | BAD (OAAS) |
| 10-shot image generation | KIT Motion-Language | Multimodality | 1.17 | BAD (OAAS) |
| 10-shot image generation | KIT Motion-Language | R Precision Top3 | 0.75 | BAD (OAAS) |
| 3D Human Pose Tracking | HumanML3D | Diversity | 9.688 | BAD (CBS) |
| 3D Human Pose Tracking | HumanML3D | FID | 0.049 | BAD (CBS) |
| 3D Human Pose Tracking | HumanML3D | Multimodality | 1.119 | BAD (CBS) |
| 3D Human Pose Tracking | HumanML3D | R Precision Top3 | 0.8 | BAD (CBS) |
| 3D Human Pose Tracking | HumanML3D | Diversity | 9.694 | BAD (OAAS) |
| 3D Human Pose Tracking | HumanML3D | FID | 0.065 | BAD (OAAS) |
| 3D Human Pose Tracking | HumanML3D | Multimodality | 1.194 | BAD (OAAS) |
| 3D Human Pose Tracking | HumanML3D | R Precision Top3 | 0.808 | BAD (OAAS) |
| 3D Human Pose Tracking | KIT Motion-Language | Diversity | 11 | BAD (OAAS) |
| 3D Human Pose Tracking | KIT Motion-Language | FID | 0.221 | BAD (OAAS) |
| 3D Human Pose Tracking | KIT Motion-Language | Multimodality | 1.17 | BAD (OAAS) |
| 3D Human Pose Tracking | KIT Motion-Language | R Precision Top3 | 0.75 | BAD (OAAS) |