Sheng Jin, Shuhuai Li, Tong Li, Wentao Liu, Chen Qian, Ping Luo
Human-centric perception (e.g. detection, segmentation, pose estimation, and attribute analysis) is a long-standing problem for computer vision. This paper introduces a unified and versatile framework (HQNet) for single-stage multi-person multi-task human-centric perception (HCP). Our approach centers on learning a unified human query representation, denoted as Human Query, which captures intricate instance-level features for individual persons and disentangles complex multi-person scenarios. Although different HCP tasks have been well-studied individually, single-stage multi-task learning of HCP tasks has not been fully exploited in the literature due to the absence of a comprehensive benchmark dataset. To address this gap, we propose COCO-UniHuman benchmark to enable model development and comprehensive evaluation. Experimental results demonstrate the proposed method's state-of-the-art performance among multi-task HCP models and its competitive performance compared to task-specific HCP models. Moreover, our experiments underscore Human Query's adaptability to new HCP tasks, thus demonstrating its robust generalization capability. Codes and data are available at https://github.com/lishuhuai527/COCO-UniHuman.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Pose Estimation | OCHuman | Test AP | 45.6 | HQNet (ViT-L) |
| Pose Estimation | OCHuman | Test AP | 40 | HQNet (ResNet-50) |
| 3D | OCHuman | Test AP | 45.6 | HQNet (ViT-L) |
| 3D | OCHuman | Test AP | 40 | HQNet (ResNet-50) |
| Instance Segmentation | OCHuman | AP | 31.1 | HQNet (ResNet-50) |
| 1 Image, 2*2 Stitchi | OCHuman | Test AP | 45.6 | HQNet (ViT-L) |
| 1 Image, 2*2 Stitchi | OCHuman | Test AP | 40 | HQNet (ResNet-50) |
| Human Instance Segmentation | OCHuman | AP | 31.1 | HQNet (ResNet-50) |