TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Rethinking and Improving Relative Position Encoding for Vi...

Rethinking and Improving Relative Position Encoding for Vision Transformer

Kan Wu, Houwen Peng, Minghao Chen, Jianlong Fu, Hongyang Chao

2021-07-29ICCV 2021 10Image ClassificationObject Detection
PaperPDFCode(official)

Abstract

Relative position encoding (RPE) is important for transformer to capture sequence ordering of input tokens. General efficacy has been proven in natural language processing. However, in computer vision, its efficacy is not well studied and even remains controversial, e.g., whether relative position encoding can work equally well as absolute position? In order to clarify this, we first review existing relative position encoding methods and analyze their pros and cons when applied in vision transformers. We then propose new relative position encoding methods dedicated to 2D images, called image RPE (iRPE). Our methods consider directional relative distance modeling as well as the interactions between queries and relative position embeddings in self-attention mechanism. The proposed iRPE methods are simple and lightweight. They can be easily plugged into transformer blocks. Experiments demonstrate that solely due to the proposed encoding methods, DeiT and DETR obtain up to 1.5% (top-1 Acc) and 1.3% (mAP) stable improvements over their original versions on ImageNet and COCO respectively, without tuning any extra hyperparameters such as learning rate and weight decay. Our ablation and analysis also yield interesting findings, some of which run counter to previous understanding. Code and models are open-sourced at https://github.com/microsoft/Cream/tree/main/iRPE.

Results

TaskDatasetMetricValueModel
Object DetectionCOCO minivalbox AP42.3DETR-ResNet50 with iRPE-K (300 epochs)
Object DetectionCOCO minivalbox AP40.8DETR-ResNet50 with iRPE-K (150 epochs)
Image ClassificationImageNetGFLOPs35.368DeiT-B with iRPE-K
Image ClassificationImageNetGFLOPs9.77DeiT-S with iRPE-QKV
Image ClassificationImageNetGFLOPs9.412DeiT-S with iRPE-QK
Image ClassificationImageNetGFLOPs9.318DeiT-S with iRPE-K
Image ClassificationImageNetGFLOPs2.568DeiT-Ti with iRPE-K
3DCOCO minivalbox AP42.3DETR-ResNet50 with iRPE-K (300 epochs)
3DCOCO minivalbox AP40.8DETR-ResNet50 with iRPE-K (150 epochs)
2D ClassificationCOCO minivalbox AP42.3DETR-ResNet50 with iRPE-K (300 epochs)
2D ClassificationCOCO minivalbox AP40.8DETR-ResNet50 with iRPE-K (150 epochs)
2D Object DetectionCOCO minivalbox AP42.3DETR-ResNet50 with iRPE-K (300 epochs)
2D Object DetectionCOCO minivalbox AP40.8DETR-ResNet50 with iRPE-K (150 epochs)
16kCOCO minivalbox AP42.3DETR-ResNet50 with iRPE-K (300 epochs)
16kCOCO minivalbox AP40.8DETR-ResNet50 with iRPE-K (150 epochs)

Related Papers

Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17Federated Learning for Commercial Image Sources2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17RS-TinyNet: Stage-wise Feature Fusion Network for Detecting Tiny Objects in Remote Sensing Images2025-07-17Decoupled PROB: Decoupled Query Initialization Tasks and Objectness-Class Learning for Open World Object Detection2025-07-17