Rethinking and Improving Relative Position Encoding for Vision Transformer

Kan Wu, Houwen Peng, Minghao Chen, Jianlong Fu, Hongyang Chao

2021-07-29ICCV 2021 10Image Classification Object Detection

Abstract

Relative position encoding (RPE) is important for transformer to capture sequence ordering of input tokens. General efficacy has been proven in natural language processing. However, in computer vision, its efficacy is not well studied and even remains controversial, e.g., whether relative position encoding can work equally well as absolute position? In order to clarify this, we first review existing relative position encoding methods and analyze their pros and cons when applied in vision transformers. We then propose new relative position encoding methods dedicated to 2D images, called image RPE (iRPE). Our methods consider directional relative distance modeling as well as the interactions between queries and relative position embeddings in self-attention mechanism. The proposed iRPE methods are simple and lightweight. They can be easily plugged into transformer blocks. Experiments demonstrate that solely due to the proposed encoding methods, DeiT and DETR obtain up to 1.5% (top-1 Acc) and 1.3% (mAP) stable improvements over their original versions on ImageNet and COCO respectively, without tuning any extra hyperparameters such as learning rate and weight decay. Our ablation and analysis also yield interesting findings, some of which run counter to previous understanding. Code and models are open-sourced at https://github.com/microsoft/Cream/tree/main/iRPE.

Results

Task	Dataset	Metric	Value	Model
Object Detection	COCO minival	box AP	42.3	DETR-ResNet50 with iRPE-K (300 epochs)
Object Detection	COCO minival	box AP	40.8	DETR-ResNet50 with iRPE-K (150 epochs)
Image Classification	ImageNet	GFLOPs	35.368	DeiT-B with iRPE-K
Image Classification	ImageNet	GFLOPs	9.77	DeiT-S with iRPE-QKV
Image Classification	ImageNet	GFLOPs	9.412	DeiT-S with iRPE-QK
Image Classification	ImageNet	GFLOPs	9.318	DeiT-S with iRPE-K
Image Classification	ImageNet	GFLOPs	2.568	DeiT-Ti with iRPE-K
3D	COCO minival	box AP	42.3	DETR-ResNet50 with iRPE-K (300 epochs)
3D	COCO minival	box AP	40.8	DETR-ResNet50 with iRPE-K (150 epochs)
2D Classification	COCO minival	box AP	42.3	DETR-ResNet50 with iRPE-K (300 epochs)
2D Classification	COCO minival	box AP	40.8	DETR-ResNet50 with iRPE-K (150 epochs)
2D Object Detection	COCO minival	box AP	42.3	DETR-ResNet50 with iRPE-K (300 epochs)
2D Object Detection	COCO minival	box AP	40.8	DETR-ResNet50 with iRPE-K (150 epochs)
16k	COCO minival	box AP	42.3	DETR-ResNet50 with iRPE-K (300 epochs)
16k	COCO minival	box AP	40.8	DETR-ResNet50 with iRPE-K (150 epochs)

Rethinking and Improving Relative Position Encoding for Vision Transformer

Abstract

Results

Related Papers

Rethinking and Improving Relative Position Encoding for Vision Transformer

Abstract

Results

Related Papers