Saeed Saadatnejad, Ali Rasekh, Mohammadreza Mofayezi, Yasamin Medghalchi, Sara Rajabzadeh, Taylor Mordan, Alexandre Alahi
Predicting 3D human poses in real-world scenarios, also known as human pose forecasting, is inevitably subject to noisy inputs arising from inaccurate 3D pose estimations and occlusions. To address these challenges, we propose a diffusion-based approach that can predict given noisy observations. We frame the prediction task as a denoising problem, where both observation and prediction are considered as a single sequence containing missing elements (whether in the observation or prediction horizon). All missing elements are treated as noise and denoised with our conditional diffusion model. To better handle long-term forecasting horizon, we present a temporal cascaded diffusion model. We demonstrate the benefits of our approach on four publicly available datasets (Human3.6M, HumanEva-I, AMASS, and 3DPW), outperforming the state-of-the-art. Additionally, we show that our framework is generic enough to improve any 3D pose prediction model as a pre-processing step to repair their inputs and a post-processing step to refine their outputs. The code is available online: \url{https://github.com/vita-epfl/DePOSit}.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Pose Estimation | AMASS | FDE@1000ms (mm) | 66.7 | TCD |
| Pose Estimation | AMASS | FDE@560ms (mm) | 49.8 | TCD |
| Pose Estimation | AMASS | FDE@720ms (mm) | 54.5 | TCD |
| Pose Estimation | AMASS | FDE@880ms (mm) | 60.1 | TCD |
| Pose Estimation | Human3.6M | ADE | 356 | TCD |
| Pose Estimation | Human3.6M | APD | 19466 | TCD |
| Pose Estimation | Human3.6M | FDE | 396 | TCD |
| Pose Estimation | Human3.6M | MMADE | 463 | TCD |
| Pose Estimation | Human3.6M | MMFDE | 445 | TCD |
| Pose Estimation | HumanEva-I | ADE@2000ms | 199 | TCD |
| Pose Estimation | HumanEva-I | APD@2000ms | 6764 | TCD |
| Pose Estimation | HumanEva-I | FDE@2000ms | 215 | TCD |
| Pose Estimation | 3DPW | FDE@1000ms (mm) | 73.4 | TCD |
| Pose Estimation | 3DPW | FDE@560ms (mm) | 55.4 | TCD |
| Pose Estimation | 3DPW | FDE@720ms (mm) | 61.6 | TCD |
| Pose Estimation | 3DPW | FDE@880ms (mm) | 67.9 | TCD |
| 3D | AMASS | FDE@1000ms (mm) | 66.7 | TCD |
| 3D | AMASS | FDE@560ms (mm) | 49.8 | TCD |
| 3D | AMASS | FDE@720ms (mm) | 54.5 | TCD |
| 3D | AMASS | FDE@880ms (mm) | 60.1 | TCD |
| 3D | Human3.6M | ADE | 356 | TCD |
| 3D | Human3.6M | APD | 19466 | TCD |
| 3D | Human3.6M | FDE | 396 | TCD |
| 3D | Human3.6M | MMADE | 463 | TCD |
| 3D | Human3.6M | MMFDE | 445 | TCD |
| 3D | HumanEva-I | ADE@2000ms | 199 | TCD |
| 3D | HumanEva-I | APD@2000ms | 6764 | TCD |
| 3D | HumanEva-I | FDE@2000ms | 215 | TCD |
| 3D | 3DPW | FDE@1000ms (mm) | 73.4 | TCD |
| 3D | 3DPW | FDE@560ms (mm) | 55.4 | TCD |
| 3D | 3DPW | FDE@720ms (mm) | 61.6 | TCD |
| 3D | 3DPW | FDE@880ms (mm) | 67.9 | TCD |
| 1 Image, 2*2 Stitchi | AMASS | FDE@1000ms (mm) | 66.7 | TCD |
| 1 Image, 2*2 Stitchi | AMASS | FDE@560ms (mm) | 49.8 | TCD |
| 1 Image, 2*2 Stitchi | AMASS | FDE@720ms (mm) | 54.5 | TCD |
| 1 Image, 2*2 Stitchi | AMASS | FDE@880ms (mm) | 60.1 | TCD |
| 1 Image, 2*2 Stitchi | Human3.6M | ADE | 356 | TCD |
| 1 Image, 2*2 Stitchi | Human3.6M | APD | 19466 | TCD |
| 1 Image, 2*2 Stitchi | Human3.6M | FDE | 396 | TCD |
| 1 Image, 2*2 Stitchi | Human3.6M | MMADE | 463 | TCD |
| 1 Image, 2*2 Stitchi | Human3.6M | MMFDE | 445 | TCD |
| 1 Image, 2*2 Stitchi | HumanEva-I | ADE@2000ms | 199 | TCD |
| 1 Image, 2*2 Stitchi | HumanEva-I | APD@2000ms | 6764 | TCD |
| 1 Image, 2*2 Stitchi | HumanEva-I | FDE@2000ms | 215 | TCD |
| 1 Image, 2*2 Stitchi | 3DPW | FDE@1000ms (mm) | 73.4 | TCD |
| 1 Image, 2*2 Stitchi | 3DPW | FDE@560ms (mm) | 55.4 | TCD |
| 1 Image, 2*2 Stitchi | 3DPW | FDE@720ms (mm) | 61.6 | TCD |
| 1 Image, 2*2 Stitchi | 3DPW | FDE@880ms (mm) | 67.9 | TCD |