Yu Sun, Qian Bao, Wu Liu, Yili Fu, Michael J. Black, Tao Mei
This paper focuses on the regression of multiple 3D people from a single RGB image. Existing approaches predominantly follow a multi-stage pipeline that first detects people in bounding boxes and then independently regresses their 3D body meshes. In contrast, we propose to Regress all meshes in a One-stage fashion for Multiple 3D People (termed ROMP). The approach is conceptually simple, bounding box-free, and able to learn a per-pixel representation in an end-to-end manner. Our method simultaneously predicts a Body Center heatmap and a Mesh Parameter map, which can jointly describe the 3D body mesh on the pixel level. Through a body-center-guided sampling process, the body mesh parameters of all people in the image are easily extracted from the Mesh Parameter map. Equipped with such a fine-grained representation, our one-stage framework is free of the complex multi-stage process and more robust to occlusion. Compared with state-of-the-art methods, ROMP achieves superior performance on the challenging multi-person benchmarks, including 3DPW and CMU Panoptic. Experiments on crowded/occluded datasets demonstrate the robustness under various types of occlusion. The released code is the first real-time implementation of monocular multi-person 3D mesh regression.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Depth Estimation | Relative Human | PCDR | 54.84 | ROMP |
| Depth Estimation | Relative Human | PCDR-Adult | 55.34 | ROMP |
| Depth Estimation | Relative Human | PCDR-Baby | 30.08 | ROMP |
| Depth Estimation | Relative Human | PCDR-Kid | 48.41 | ROMP |
| Depth Estimation | Relative Human | PCDR-Teen | 51.12 | ROMP |
| Depth Estimation | Relative Human | mPCDK | 0.866 | ROMP |
| 3D Human Pose Estimation | EMDB | Average MPJAE (deg) | 26.5975 | ROMP |
| 3D Human Pose Estimation | EMDB | Average MPJAE-PA (deg) | 23.9901 | ROMP |
| 3D Human Pose Estimation | EMDB | Average MPJPE (mm) | 112.652 | ROMP |
| 3D Human Pose Estimation | EMDB | Average MPJPE-PA (mm) | 75.1869 | ROMP |
| 3D Human Pose Estimation | EMDB | Average MVE (mm) | 134.863 | ROMP |
| 3D Human Pose Estimation | EMDB | Average MVE-PA (mm) | 90.648 | ROMP |
| 3D Human Pose Estimation | EMDB | Jitter (10m/s^3) | 71.2556 | ROMP |
| 3D Human Pose Estimation | Panoptic | Average MPJPE (mm) | 127.6 | ROMP (ResNet-50) |
| 3D Human Pose Estimation | 3D Poses in the Wild Challenge | MPJPE | 81.76 | ROMP |
| 3D Human Pose Estimation | Relative Human | PCDR | 68.27 | ROMP |
| Pose Estimation | EMDB | Average MPJAE (deg) | 26.5975 | ROMP |
| Pose Estimation | EMDB | Average MPJAE-PA (deg) | 23.9901 | ROMP |
| Pose Estimation | EMDB | Average MPJPE (mm) | 112.652 | ROMP |
| Pose Estimation | EMDB | Average MPJPE-PA (mm) | 75.1869 | ROMP |
| Pose Estimation | EMDB | Average MVE (mm) | 134.863 | ROMP |
| Pose Estimation | EMDB | Average MVE-PA (mm) | 90.648 | ROMP |
| Pose Estimation | EMDB | Jitter (10m/s^3) | 71.2556 | ROMP |
| Pose Estimation | Panoptic | Average MPJPE (mm) | 127.6 | ROMP (ResNet-50) |
| Pose Estimation | 3D Poses in the Wild Challenge | MPJPE | 81.76 | ROMP |
| Pose Estimation | Relative Human | PCDR | 68.27 | ROMP |
| Pose Estimation | CrowdPose | mAP @0.5:0.95 | 58.6 | ROMP+CAR |
| Pose Estimation | CrowdPose | mAP @0.5:0.95 | 55.6 | ROMP |
| 3D | EMDB | Average MPJAE (deg) | 26.5975 | ROMP |
| 3D | EMDB | Average MPJAE-PA (deg) | 23.9901 | ROMP |
| 3D | EMDB | Average MPJPE (mm) | 112.652 | ROMP |
| 3D | EMDB | Average MPJPE-PA (mm) | 75.1869 | ROMP |
| 3D | EMDB | Average MVE (mm) | 134.863 | ROMP |
| 3D | EMDB | Average MVE-PA (mm) | 90.648 | ROMP |
| 3D | EMDB | Jitter (10m/s^3) | 71.2556 | ROMP |
| 3D | Panoptic | Average MPJPE (mm) | 127.6 | ROMP (ResNet-50) |
| 3D | 3D Poses in the Wild Challenge | MPJPE | 81.76 | ROMP |
| 3D | Relative Human | PCDR | 68.27 | ROMP |
| 3D | CrowdPose | mAP @0.5:0.95 | 58.6 | ROMP+CAR |
| 3D | CrowdPose | mAP @0.5:0.95 | 55.6 | ROMP |
| 3D | Relative Human | PCDR | 54.84 | ROMP |
| 3D | Relative Human | PCDR-Adult | 55.34 | ROMP |
| 3D | Relative Human | PCDR-Baby | 30.08 | ROMP |
| 3D | Relative Human | PCDR-Kid | 48.41 | ROMP |
| 3D | Relative Human | PCDR-Teen | 51.12 | ROMP |
| 3D | Relative Human | mPCDK | 0.866 | ROMP |
| 3D Multi-Person Pose Estimation | Relative Human | PCDR | 68.27 | ROMP |
| 3D Depth Estimation | Relative Human | PCDR | 54.84 | ROMP |
| 3D Depth Estimation | Relative Human | PCDR-Adult | 55.34 | ROMP |
| 3D Depth Estimation | Relative Human | PCDR-Baby | 30.08 | ROMP |
| 3D Depth Estimation | Relative Human | PCDR-Kid | 48.41 | ROMP |
| 3D Depth Estimation | Relative Human | PCDR-Teen | 51.12 | ROMP |
| 3D Depth Estimation | Relative Human | mPCDK | 0.866 | ROMP |
| Multi-Person Pose Estimation | CrowdPose | mAP @0.5:0.95 | 58.6 | ROMP+CAR |
| Multi-Person Pose Estimation | CrowdPose | mAP @0.5:0.95 | 55.6 | ROMP |
| 1 Image, 2*2 Stitchi | EMDB | Average MPJAE (deg) | 26.5975 | ROMP |
| 1 Image, 2*2 Stitchi | EMDB | Average MPJAE-PA (deg) | 23.9901 | ROMP |
| 1 Image, 2*2 Stitchi | EMDB | Average MPJPE (mm) | 112.652 | ROMP |
| 1 Image, 2*2 Stitchi | EMDB | Average MPJPE-PA (mm) | 75.1869 | ROMP |
| 1 Image, 2*2 Stitchi | EMDB | Average MVE (mm) | 134.863 | ROMP |
| 1 Image, 2*2 Stitchi | EMDB | Average MVE-PA (mm) | 90.648 | ROMP |
| 1 Image, 2*2 Stitchi | EMDB | Jitter (10m/s^3) | 71.2556 | ROMP |
| 1 Image, 2*2 Stitchi | Panoptic | Average MPJPE (mm) | 127.6 | ROMP (ResNet-50) |
| 1 Image, 2*2 Stitchi | 3D Poses in the Wild Challenge | MPJPE | 81.76 | ROMP |
| 1 Image, 2*2 Stitchi | Relative Human | PCDR | 68.27 | ROMP |
| 1 Image, 2*2 Stitchi | CrowdPose | mAP @0.5:0.95 | 58.6 | ROMP+CAR |
| 1 Image, 2*2 Stitchi | CrowdPose | mAP @0.5:0.95 | 55.6 | ROMP |