Yizhou Wang, Yixuan Wu, Shixiang Tang, Weizhen He, Xun Guo, Feng Zhu, Lei Bai, Rui Zhao, Jian Wu, Tong He, Wanli Ouyang
Human-centric perception tasks, e.g., pedestrian detection, skeleton-based action recognition, and pose estimation, have wide industrial applications, such as metaverse and sports analysis. There is a recent surge to develop human-centric foundation models that can benefit a broad range of human-centric perception tasks. While many human-centric foundation models have achieved success, they did not explore 3D and vision-language tasks for human-centric and required task-specific finetuning. These limitations restrict their application to more downstream tasks and situations. To tackle these problems, we present Hulk, the first multimodal human-centric generalist model, capable of addressing 2D vision, 3D vision, skeleton-based, and vision-language tasks without task-specific finetuning. The key to achieving this is condensing various task-specific heads into two general heads, one for discrete representations, e.g., languages, and the other for continuous representations, e.g., location coordinates. The outputs of two heads can be further stacked into four distinct input and output modalities. This uniform representation enables Hulk to treat diverse human-centric tasks as modality translation, integrating knowledge across a wide range of tasks. Comprehensive evaluations of Hulk on 12 benchmarks covering 8 human-centric tasks demonstrate the superiority of our proposed method, achieving state-of-the-art performance in 11 benchmarks. The code is available on https://github.com/OpenGVLab/Hulk.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Autonomous Vehicles | PA-100K | Accuracy | 88.97 | Hulk(Finetune, ViT-L) |
| Autonomous Vehicles | PA-100K | Accuracy | 87.85 | Hulk(Finetune, ViT-B) |
| Autonomous Vehicles | RAPv2 | Accuracy | 85.86 | Hulk(Finetune, ViT-L) |
| Autonomous Vehicles | RAPv2 | Accuracy | 85.26 | Hulk(Finetune, ViT-B) |
| 3D Human Pose Estimation | 3DPW | MPJPE | 66.3 | Hulk(ViT-L) |
| 3D Human Pose Estimation | 3DPW | MPVPE | 77.4 | Hulk(ViT-L) |
| 3D Human Pose Estimation | 3DPW | PA-MPJPE | 38.5 | Hulk(ViT-L) |
| 3D Human Pose Estimation | 3DPW | MPJPE | 67 | Hulk(ViT-B) |
| 3D Human Pose Estimation | 3DPW | MPVPE | 79.8 | Hulk(ViT-B) |
| 3D Human Pose Estimation | 3DPW | PA-MPJPE | 39.9 | Hulk(ViT-B) |
| Video | NTU RGB+D | Accuracy (CS) | 94.3 | Hulk(Finetune, ViT-L) |
| Video | NTU RGB+D | Accuracy (CS) | 94 | Hulk(Finetune, ViT-B) |
| Temporal Action Localization | NTU RGB+D | Accuracy (CS) | 94.3 | Hulk(Finetune, ViT-L) |
| Temporal Action Localization | NTU RGB+D | Accuracy (CS) | 94 | Hulk(Finetune, ViT-B) |
| Zero-Shot Learning | NTU RGB+D | Accuracy (CS) | 94.3 | Hulk(Finetune, ViT-L) |
| Zero-Shot Learning | NTU RGB+D | Accuracy (CS) | 94 | Hulk(Finetune, ViT-B) |
| Activity Recognition | NTU RGB+D | Accuracy (CS) | 94.3 | Hulk(Finetune, ViT-L) |
| Activity Recognition | NTU RGB+D | Accuracy (CS) | 94 | Hulk(Finetune, ViT-B) |
| Action Localization | NTU RGB+D | Accuracy (CS) | 94.3 | Hulk(Finetune, ViT-L) |
| Action Localization | NTU RGB+D | Accuracy (CS) | 94 | Hulk(Finetune, ViT-B) |
| Pose Estimation | COCO (Common Objects in Context) | AP | 78.7 | Hulk(Finetune, ViT-L) |
| Pose Estimation | COCO (Common Objects in Context) | AP | 77.5 | Hulk(Finetune, ViT-B) |
| Pose Estimation | AIC | AP | 37.1 | Hulk(Finetune, ViT-L) |
| Pose Estimation | AIC | AP | 35.6 | Hulk(Finetune, ViT-B) |
| Pose Estimation | 3DPW | MPJPE | 66.3 | Hulk(ViT-L) |
| Pose Estimation | 3DPW | MPVPE | 77.4 | Hulk(ViT-L) |
| Pose Estimation | 3DPW | PA-MPJPE | 38.5 | Hulk(ViT-L) |
| Pose Estimation | 3DPW | MPJPE | 67 | Hulk(ViT-B) |
| Pose Estimation | 3DPW | MPVPE | 79.8 | Hulk(ViT-B) |
| Pose Estimation | 3DPW | PA-MPJPE | 39.9 | Hulk(ViT-B) |
| Action Detection | NTU RGB+D | Accuracy (CS) | 94.3 | Hulk(Finetune, ViT-L) |
| Action Detection | NTU RGB+D | Accuracy (CS) | 94 | Hulk(Finetune, ViT-B) |
| Pedestrian Attribute Recognition | PA-100K | Accuracy | 88.97 | Hulk(Finetune, ViT-L) |
| Pedestrian Attribute Recognition | PA-100K | Accuracy | 87.85 | Hulk(Finetune, ViT-B) |
| Pedestrian Attribute Recognition | RAPv2 | Accuracy | 85.86 | Hulk(Finetune, ViT-L) |
| Pedestrian Attribute Recognition | RAPv2 | Accuracy | 85.26 | Hulk(Finetune, ViT-B) |
| 3D Action Recognition | NTU RGB+D | Accuracy (CS) | 94.3 | Hulk(Finetune, ViT-L) |
| 3D Action Recognition | NTU RGB+D | Accuracy (CS) | 94 | Hulk(Finetune, ViT-B) |
| Human Part Segmentation | Human3.6M | mIoU | 69.89 | Hulk(Finetune, ViT-L) |
| Human Part Segmentation | Human3.6M | mIoU | 68.56 | Hulk(Finetune, ViT-B) |
| Human Part Segmentation | CIHP | Mean IoU | 72.68 | Hulk(Finetune, ViT-L) |
| Human Part Segmentation | CIHP | Mean IoU | 71.26 | Hulk(Finetune, ViT-B) |
| Object Detection | CrowdHuman (full body) | AP | 93 | Hulk(Finetune, ViT-L) |
| Object Detection | CrowdHuman (full body) | mMR | 36.5 | Hulk(Finetune, ViT-L) |
| Object Detection | CrowdHuman (full body) | AP | 92.4 | Hulk(Finetune, ViT-B) |
| Object Detection | CrowdHuman (full body) | mMR | 40.7 | Hulk(Finetune, ViT-B) |
| 3D | CrowdHuman (full body) | AP | 93 | Hulk(Finetune, ViT-L) |
| 3D | CrowdHuman (full body) | mMR | 36.5 | Hulk(Finetune, ViT-L) |
| 3D | CrowdHuman (full body) | AP | 92.4 | Hulk(Finetune, ViT-B) |
| 3D | CrowdHuman (full body) | mMR | 40.7 | Hulk(Finetune, ViT-B) |
| 3D | COCO (Common Objects in Context) | AP | 78.7 | Hulk(Finetune, ViT-L) |
| 3D | COCO (Common Objects in Context) | AP | 77.5 | Hulk(Finetune, ViT-B) |
| 3D | AIC | AP | 37.1 | Hulk(Finetune, ViT-L) |
| 3D | AIC | AP | 35.6 | Hulk(Finetune, ViT-B) |
| 3D | 3DPW | MPJPE | 66.3 | Hulk(ViT-L) |
| 3D | 3DPW | MPVPE | 77.4 | Hulk(ViT-L) |
| 3D | 3DPW | PA-MPJPE | 38.5 | Hulk(ViT-L) |
| 3D | 3DPW | MPJPE | 67 | Hulk(ViT-B) |
| 3D | 3DPW | MPVPE | 79.8 | Hulk(ViT-B) |
| 3D | 3DPW | PA-MPJPE | 39.9 | Hulk(ViT-B) |
| Action Recognition | NTU RGB+D | Accuracy (CS) | 94.3 | Hulk(Finetune, ViT-L) |
| Action Recognition | NTU RGB+D | Accuracy (CS) | 94 | Hulk(Finetune, ViT-B) |
| 2D Semantic Segmentation | Human3.6M | mIoU | 69.89 | Hulk(Finetune, ViT-L) |
| 2D Semantic Segmentation | Human3.6M | mIoU | 68.56 | Hulk(Finetune, ViT-B) |
| 2D Semantic Segmentation | CIHP | Mean IoU | 72.68 | Hulk(Finetune, ViT-L) |
| 2D Semantic Segmentation | CIHP | Mean IoU | 71.26 | Hulk(Finetune, ViT-B) |
| 2D Classification | CrowdHuman (full body) | AP | 93 | Hulk(Finetune, ViT-L) |
| 2D Classification | CrowdHuman (full body) | mMR | 36.5 | Hulk(Finetune, ViT-L) |
| 2D Classification | CrowdHuman (full body) | AP | 92.4 | Hulk(Finetune, ViT-B) |
| 2D Classification | CrowdHuman (full body) | mMR | 40.7 | Hulk(Finetune, ViT-B) |
| 2D Object Detection | CrowdHuman (full body) | AP | 93 | Hulk(Finetune, ViT-L) |
| 2D Object Detection | CrowdHuman (full body) | mMR | 36.5 | Hulk(Finetune, ViT-L) |
| 2D Object Detection | CrowdHuman (full body) | AP | 92.4 | Hulk(Finetune, ViT-B) |
| 2D Object Detection | CrowdHuman (full body) | mMR | 40.7 | Hulk(Finetune, ViT-B) |
| 1 Image, 2*2 Stitchi | COCO (Common Objects in Context) | AP | 78.7 | Hulk(Finetune, ViT-L) |
| 1 Image, 2*2 Stitchi | COCO (Common Objects in Context) | AP | 77.5 | Hulk(Finetune, ViT-B) |
| 1 Image, 2*2 Stitchi | AIC | AP | 37.1 | Hulk(Finetune, ViT-L) |
| 1 Image, 2*2 Stitchi | AIC | AP | 35.6 | Hulk(Finetune, ViT-B) |
| 1 Image, 2*2 Stitchi | 3DPW | MPJPE | 66.3 | Hulk(ViT-L) |
| 1 Image, 2*2 Stitchi | 3DPW | MPVPE | 77.4 | Hulk(ViT-L) |
| 1 Image, 2*2 Stitchi | 3DPW | PA-MPJPE | 38.5 | Hulk(ViT-L) |
| 1 Image, 2*2 Stitchi | 3DPW | MPJPE | 67 | Hulk(ViT-B) |
| 1 Image, 2*2 Stitchi | 3DPW | MPVPE | 79.8 | Hulk(ViT-B) |
| 1 Image, 2*2 Stitchi | 3DPW | PA-MPJPE | 39.9 | Hulk(ViT-B) |
| 16k | CrowdHuman (full body) | AP | 93 | Hulk(Finetune, ViT-L) |
| 16k | CrowdHuman (full body) | mMR | 36.5 | Hulk(Finetune, ViT-L) |
| 16k | CrowdHuman (full body) | AP | 92.4 | Hulk(Finetune, ViT-B) |
| 16k | CrowdHuman (full body) | mMR | 40.7 | Hulk(Finetune, ViT-B) |