PVT v2: Improved Baselines with Pyramid Vision Transformer

Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao

2021-06-25Panoptic Segmentation Image Classification Object Detection

Paper PDF Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code(official)Code Code Code

Abstract

Transformer recently has presented encouraging progress in computer vision. In this work, we present new baselines by improving the original Pyramid Vision Transformer (PVT v1) by adding three designs, including (1) linear complexity attention layer, (2) overlapping patch embedding, and (3) convolutional feed-forward network. With these modifications, PVT v2 reduces the computational complexity of PVT v1 to linear and achieves significant improvements on fundamental vision tasks such as classification, detection, and segmentation. Notably, the proposed PVT v2 achieves comparable or better performances than recent works such as Swin Transformer. We hope this work will facilitate state-of-the-art Transformer researches in computer vision. Code is available at https://github.com/whai362/PVT.

Results

Task	Dataset	Metric	Value	Model
Object Detection	COCO-O	Average mAP	28.2	PVTv2-B5 (Mask R-CNN)
Object Detection	COCO-O	Effective Robustness	6.85	PVTv2-B5 (Mask R-CNN)
Object Detection	COCO minival	AP50	69.5	Sparse R-CNN (PVTv2-B2)
Object Detection	COCO minival	AP75	54.9	Sparse R-CNN (PVTv2-B2)
Object Detection	COCO minival	box AP	50.1	Sparse R-CNN (PVTv2-B2)
Image Classification	ImageNet	GFLOPs	11.8	PVTv2-B4
Image Classification	ImageNet	GFLOPs	6.9	PVTv2-B3
Image Classification	ImageNet	GFLOPs	4	PVTv2-B2
Image Classification	ImageNet	GFLOPs	2.1	PVTv2-B1
Image Classification	ImageNet	GFLOPs	0.6	PVTv2-B0
3D	COCO-O	Average mAP	28.2	PVTv2-B5 (Mask R-CNN)
3D	COCO-O	Effective Robustness	6.85	PVTv2-B5 (Mask R-CNN)
3D	COCO minival	AP50	69.5	Sparse R-CNN (PVTv2-B2)
3D	COCO minival	AP75	54.9	Sparse R-CNN (PVTv2-B2)
3D	COCO minival	box AP	50.1	Sparse R-CNN (PVTv2-B2)
2D Classification	COCO-O	Average mAP	28.2	PVTv2-B5 (Mask R-CNN)
2D Classification	COCO-O	Effective Robustness	6.85	PVTv2-B5 (Mask R-CNN)
2D Classification	COCO minival	AP50	69.5	Sparse R-CNN (PVTv2-B2)
2D Classification	COCO minival	AP75	54.9	Sparse R-CNN (PVTv2-B2)
2D Classification	COCO minival	box AP	50.1	Sparse R-CNN (PVTv2-B2)
2D Object Detection	COCO-O	Average mAP	28.2	PVTv2-B5 (Mask R-CNN)
2D Object Detection	COCO-O	Effective Robustness	6.85	PVTv2-B5 (Mask R-CNN)
2D Object Detection	COCO minival	AP50	69.5	Sparse R-CNN (PVTv2-B2)
2D Object Detection	COCO minival	AP75	54.9	Sparse R-CNN (PVTv2-B2)
2D Object Detection	COCO minival	box AP	50.1	Sparse R-CNN (PVTv2-B2)
16k	COCO-O	Average mAP	28.2	PVTv2-B5 (Mask R-CNN)
16k	COCO-O	Effective Robustness	6.85	PVTv2-B5 (Mask R-CNN)
16k	COCO minival	AP50	69.5	Sparse R-CNN (PVTv2-B2)
16k	COCO minival	AP75	54.9	Sparse R-CNN (PVTv2-B2)
16k	COCO minival	box AP	50.1	Sparse R-CNN (PVTv2-B2)

PVT v2: Improved Baselines with Pyramid Vision Transformer

Abstract

Results

Related Papers

PVT v2: Improved Baselines with Pyramid Vision Transformer

Abstract

Results

Related Papers