Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao
Transformer recently has presented encouraging progress in computer vision. In this work, we present new baselines by improving the original Pyramid Vision Transformer (PVT v1) by adding three designs, including (1) linear complexity attention layer, (2) overlapping patch embedding, and (3) convolutional feed-forward network. With these modifications, PVT v2 reduces the computational complexity of PVT v1 to linear and achieves significant improvements on fundamental vision tasks such as classification, detection, and segmentation. Notably, the proposed PVT v2 achieves comparable or better performances than recent works such as Swin Transformer. We hope this work will facilitate state-of-the-art Transformer researches in computer vision. Code is available at https://github.com/whai362/PVT.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Object Detection | COCO-O | Average mAP | 28.2 | PVTv2-B5 (Mask R-CNN) |
| Object Detection | COCO-O | Effective Robustness | 6.85 | PVTv2-B5 (Mask R-CNN) |
| Object Detection | COCO minival | AP50 | 69.5 | Sparse R-CNN (PVTv2-B2) |
| Object Detection | COCO minival | AP75 | 54.9 | Sparse R-CNN (PVTv2-B2) |
| Object Detection | COCO minival | box AP | 50.1 | Sparse R-CNN (PVTv2-B2) |
| Image Classification | ImageNet | GFLOPs | 11.8 | PVTv2-B4 |
| Image Classification | ImageNet | GFLOPs | 6.9 | PVTv2-B3 |
| Image Classification | ImageNet | GFLOPs | 4 | PVTv2-B2 |
| Image Classification | ImageNet | GFLOPs | 2.1 | PVTv2-B1 |
| Image Classification | ImageNet | GFLOPs | 0.6 | PVTv2-B0 |
| 3D | COCO-O | Average mAP | 28.2 | PVTv2-B5 (Mask R-CNN) |
| 3D | COCO-O | Effective Robustness | 6.85 | PVTv2-B5 (Mask R-CNN) |
| 3D | COCO minival | AP50 | 69.5 | Sparse R-CNN (PVTv2-B2) |
| 3D | COCO minival | AP75 | 54.9 | Sparse R-CNN (PVTv2-B2) |
| 3D | COCO minival | box AP | 50.1 | Sparse R-CNN (PVTv2-B2) |
| 2D Classification | COCO-O | Average mAP | 28.2 | PVTv2-B5 (Mask R-CNN) |
| 2D Classification | COCO-O | Effective Robustness | 6.85 | PVTv2-B5 (Mask R-CNN) |
| 2D Classification | COCO minival | AP50 | 69.5 | Sparse R-CNN (PVTv2-B2) |
| 2D Classification | COCO minival | AP75 | 54.9 | Sparse R-CNN (PVTv2-B2) |
| 2D Classification | COCO minival | box AP | 50.1 | Sparse R-CNN (PVTv2-B2) |
| 2D Object Detection | COCO-O | Average mAP | 28.2 | PVTv2-B5 (Mask R-CNN) |
| 2D Object Detection | COCO-O | Effective Robustness | 6.85 | PVTv2-B5 (Mask R-CNN) |
| 2D Object Detection | COCO minival | AP50 | 69.5 | Sparse R-CNN (PVTv2-B2) |
| 2D Object Detection | COCO minival | AP75 | 54.9 | Sparse R-CNN (PVTv2-B2) |
| 2D Object Detection | COCO minival | box AP | 50.1 | Sparse R-CNN (PVTv2-B2) |
| 16k | COCO-O | Average mAP | 28.2 | PVTv2-B5 (Mask R-CNN) |
| 16k | COCO-O | Effective Robustness | 6.85 | PVTv2-B5 (Mask R-CNN) |
| 16k | COCO minival | AP50 | 69.5 | Sparse R-CNN (PVTv2-B2) |
| 16k | COCO minival | AP75 | 54.9 | Sparse R-CNN (PVTv2-B2) |
| 16k | COCO minival | box AP | 50.1 | Sparse R-CNN (PVTv2-B2) |