Description
Co-Scale Conv-Attentional Image Transformer (CoaT) is a Transformer-based image classifier equipped with co-scale and conv-attentional mechanisms. First, the co-scale mechanism maintains the integrity of Transformers' encoder branches at individual scales, while allowing representations learned at different scales to effectively communicate with each other. Second, the conv-attentional mechanism is designed by realizing a relative position embedding formulation in the factorized attention module with an efficient convolution-like implementation. CoaT empowers image Transformers with enriched multi-scale and contextual modeling capabilities.
Papers Using This Method
Enhance Mobile Agents Thinking Process Via Iterative Preference Learning2025-05-18What do Vision Transformers Learn? A Visual Exploration2022-12-13Exploring Advances in Transformers and CNN for Skin Lesion Diagnosis on Small Datasets2022-05-30Exploring and Improving Mobile Level Vision Transformers2021-08-30Co-Scale Conv-Attentional Image Transformers2021-04-13