Yuhui Yuan, Rao Fu, Lang Huang, WeiHong Lin, Chao Zhang, Xilin Chen, Jingdong Wang
We present a High-Resolution Transformer (HRFormer) that learns high-resolution representations for dense prediction tasks, in contrast to the original Vision Transformer that produces low-resolution representations and has high memory and computational cost. We take advantage of the multi-resolution parallel design introduced in high-resolution convolutional networks (HRNet), along with local-window self-attention that performs self-attention over small non-overlapping image windows, for improving the memory and computation efficiency. In addition, we introduce a convolution into the FFN to exchange information across the disconnected image windows. We demonstrate the effectiveness of the High-Resolution Transformer on both human pose estimation and semantic segmentation tasks, e.g., HRFormer outperforms Swin transformer by $1.3$ AP on COCO pose estimation with $50\%$ fewer parameters and $30\%$ fewer FLOPs. Code is available at: https://github.com/HRNet/HRFormer.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Pose Estimation | AIC | AP | 34.4 | HRFormer (HRFomer-B) |
| Pose Estimation | AIC | AP50 | 78.3 | HRFormer (HRFomer-B) |
| Pose Estimation | AIC | AP75 | 24.8 | HRFormer (HRFomer-B) |
| Pose Estimation | AIC | AR | 38.7 | HRFormer (HRFomer-B) |
| Pose Estimation | AIC | AR50 | 80.9 | HRFormer (HRFomer-B) |
| Pose Estimation | AIC | AP | 31.6 | HRFormer (HRFomer-S) |
| Pose Estimation | AIC | AP75 | 20.9 | HRFormer (HRFomer-S) |
| Pose Estimation | AIC | AR | 35.8 | HRFormer (HRFomer-S) |
| Pose Estimation | AIC | AR50 | 78 | HRFormer (HRFomer-S) |
| Pose Estimation | COCO test-dev | AP | 76.2 | HRFormer-B |
| Pose Estimation | COCO test-dev | AP50 | 92.7 | HRFormer-B |
| Pose Estimation | COCO test-dev | AP75 | 83.8 | HRFormer-B |
| Pose Estimation | COCO test-dev | APL | 82.3 | HRFormer-B |
| Pose Estimation | COCO test-dev | APM | 72.5 | HRFormer-B |
| Pose Estimation | COCO test-dev | AR | 81.2 | HRFormer-B |
| Pose Estimation | CrowdPose | AP Easy | 80 | HRFormer-B |
| Pose Estimation | CrowdPose | AP Hard | 62.4 | HRFormer-B |
| Pose Estimation | CrowdPose | AP Medium | 73.5 | HRFormer-B |
| Pose Estimation | CrowdPose | mAP @0.5:0.95 | 72.4 | HRFormer-B |
| Pose Estimation | OCHuman | AP50 | 81.4 | HRFormer-B |
| Pose Estimation | OCHuman | AP75 | 67.1 | HRFormer-B |
| Pose Estimation | OCHuman | Validation AP | 62.1 | HRFormer-B |
| Image Classification | ImageNet | GFLOPs | 13.7 | HRFormer-B |
| Image Classification | ImageNet | GFLOPs | 1.8 | HRFormer-T |
| 3D | AIC | AP | 34.4 | HRFormer (HRFomer-B) |
| 3D | AIC | AP50 | 78.3 | HRFormer (HRFomer-B) |
| 3D | AIC | AP75 | 24.8 | HRFormer (HRFomer-B) |
| 3D | AIC | AR | 38.7 | HRFormer (HRFomer-B) |
| 3D | AIC | AR50 | 80.9 | HRFormer (HRFomer-B) |
| 3D | AIC | AP | 31.6 | HRFormer (HRFomer-S) |
| 3D | AIC | AP75 | 20.9 | HRFormer (HRFomer-S) |
| 3D | AIC | AR | 35.8 | HRFormer (HRFomer-S) |
| 3D | AIC | AR50 | 78 | HRFormer (HRFomer-S) |
| 3D | COCO test-dev | AP | 76.2 | HRFormer-B |
| 3D | COCO test-dev | AP50 | 92.7 | HRFormer-B |
| 3D | COCO test-dev | AP75 | 83.8 | HRFormer-B |
| 3D | COCO test-dev | APL | 82.3 | HRFormer-B |
| 3D | COCO test-dev | APM | 72.5 | HRFormer-B |
| 3D | COCO test-dev | AR | 81.2 | HRFormer-B |
| 3D | CrowdPose | AP Easy | 80 | HRFormer-B |
| 3D | CrowdPose | AP Hard | 62.4 | HRFormer-B |
| 3D | CrowdPose | AP Medium | 73.5 | HRFormer-B |
| 3D | CrowdPose | mAP @0.5:0.95 | 72.4 | HRFormer-B |
| 3D | OCHuman | AP50 | 81.4 | HRFormer-B |
| 3D | OCHuman | AP75 | 67.1 | HRFormer-B |
| 3D | OCHuman | Validation AP | 62.1 | HRFormer-B |
| Multi-Person Pose Estimation | CrowdPose | AP Easy | 80 | HRFormer-B |
| Multi-Person Pose Estimation | CrowdPose | AP Hard | 62.4 | HRFormer-B |
| Multi-Person Pose Estimation | CrowdPose | AP Medium | 73.5 | HRFormer-B |
| Multi-Person Pose Estimation | CrowdPose | mAP @0.5:0.95 | 72.4 | HRFormer-B |
| Multi-Person Pose Estimation | OCHuman | AP50 | 81.4 | HRFormer-B |
| Multi-Person Pose Estimation | OCHuman | AP75 | 67.1 | HRFormer-B |
| Multi-Person Pose Estimation | OCHuman | Validation AP | 62.1 | HRFormer-B |
| 1 Image, 2*2 Stitchi | AIC | AP | 34.4 | HRFormer (HRFomer-B) |
| 1 Image, 2*2 Stitchi | AIC | AP50 | 78.3 | HRFormer (HRFomer-B) |
| 1 Image, 2*2 Stitchi | AIC | AP75 | 24.8 | HRFormer (HRFomer-B) |
| 1 Image, 2*2 Stitchi | AIC | AR | 38.7 | HRFormer (HRFomer-B) |
| 1 Image, 2*2 Stitchi | AIC | AR50 | 80.9 | HRFormer (HRFomer-B) |
| 1 Image, 2*2 Stitchi | AIC | AP | 31.6 | HRFormer (HRFomer-S) |
| 1 Image, 2*2 Stitchi | AIC | AP75 | 20.9 | HRFormer (HRFomer-S) |
| 1 Image, 2*2 Stitchi | AIC | AR | 35.8 | HRFormer (HRFomer-S) |
| 1 Image, 2*2 Stitchi | AIC | AR50 | 78 | HRFormer (HRFomer-S) |
| 1 Image, 2*2 Stitchi | COCO test-dev | AP | 76.2 | HRFormer-B |
| 1 Image, 2*2 Stitchi | COCO test-dev | AP50 | 92.7 | HRFormer-B |
| 1 Image, 2*2 Stitchi | COCO test-dev | AP75 | 83.8 | HRFormer-B |
| 1 Image, 2*2 Stitchi | COCO test-dev | APL | 82.3 | HRFormer-B |
| 1 Image, 2*2 Stitchi | COCO test-dev | APM | 72.5 | HRFormer-B |
| 1 Image, 2*2 Stitchi | COCO test-dev | AR | 81.2 | HRFormer-B |
| 1 Image, 2*2 Stitchi | CrowdPose | AP Easy | 80 | HRFormer-B |
| 1 Image, 2*2 Stitchi | CrowdPose | AP Hard | 62.4 | HRFormer-B |
| 1 Image, 2*2 Stitchi | CrowdPose | AP Medium | 73.5 | HRFormer-B |
| 1 Image, 2*2 Stitchi | CrowdPose | mAP @0.5:0.95 | 72.4 | HRFormer-B |
| 1 Image, 2*2 Stitchi | OCHuman | AP50 | 81.4 | HRFormer-B |
| 1 Image, 2*2 Stitchi | OCHuman | AP75 | 67.1 | HRFormer-B |
| 1 Image, 2*2 Stitchi | OCHuman | Validation AP | 62.1 | HRFormer-B |