Description
Conditional Positional Encoding, or CPE, is a type of positional encoding for vision transformers. Unlike previous fixed or learnable positional encodings, which are predefined and independent of input tokens, CPE is dynamically generated and conditioned on the local neighborhood of the input tokens. As a result, CPE aims to generalize to the input sequences that are longer than what the model has ever seen during training. CPE can also keep the desired translation-invariance in the image classification task. CPE can be implemented with a Position Encoding Generator (PEG) and incorporated into the current Transformer framework.
Papers Using This Method
WriteViT: Handwritten Text Generation with Vision Transformer2025-05-19CBraMod: A Criss-Cross Brain Foundation Model for EEG Decoding2024-12-10Serialized Point Mamba: A Serialized Point Cloud Mamba Segmentation Model2024-07-17Heracles: A Hybrid SSM-Transformer Model for High-Resolution Image and Time-Series Analysis2024-03-26V4d: voxel for 4d novel view synthesis2022-05-28Twins: Revisiting the Design of Spatial Attention in Vision Transformers2021-04-28Conditional Positional Encodings for Vision Transformers2021-02-22