TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Incorporating Convolution Designs into Visual Transformers

Incorporating Convolution Designs into Visual Transformers

Kun Yuan, Shaopeng Guo, Ziwei Liu, Aojun Zhou, Fengwei Yu, Wei Wu

2021-03-22ICCV 2021 10Image Classification
PaperPDFCodeCodeCode(official)

Abstract

Motivated by the success of Transformers in natural language processing (NLP) tasks, there emerge some attempts (e.g., ViT and DeiT) to apply Transformers to the vision domain. However, pure Transformer architectures often require a large amount of training data or extra supervision to obtain comparable performance with convolutional neural networks (CNNs). To overcome these limitations, we analyze the potential drawbacks when directly borrowing Transformer architectures from NLP. Then we propose a new \textbf{Convolution-enhanced image Transformer (CeiT)} which combines the advantages of CNNs in extracting low-level features, strengthening locality, and the advantages of Transformers in establishing long-range dependencies. Three modifications are made to the original Transformer: \textbf{1)} instead of the straightforward tokenization from raw input images, we design an \textbf{Image-to-Tokens (I2T)} module that extracts patches from generated low-level features; \textbf{2)} the feed-froward network in each encoder block is replaced with a \textbf{Locally-enhanced Feed-Forward (LeFF)} layer that promotes the correlation among neighboring tokens in the spatial dimension; \textbf{3)} a \textbf{Layer-wise Class token Attention (LCA)} is attached at the top of the Transformer that utilizes the multi-level representations. Experimental results on ImageNet and seven downstream tasks show the effectiveness and generalization ability of CeiT compared with previous Transformers and state-of-the-art CNNs, without requiring a large amount of training data and extra CNN teachers. Besides, CeiT models also demonstrate better convergence with $3\times$ fewer training iterations, which can reduce the training cost significantly\footnote{Code and models will be released upon acceptance.}.

Results

TaskDatasetMetricValueModel
Image ClassificationStanford CarsAccuracy94.1CeiT-S (384 finetune resolution)
Image ClassificationStanford CarsAccuracy93.2CeiT-S
Image ClassificationStanford CarsAccuracy93CeiT-T (384 finetune resolution)
Image ClassificationStanford CarsAccuracy90.5CeiT-T
Image ClassificationCIFAR-10Percentage correct99.1CeiT-S (384 finetune resolution)
Image ClassificationCIFAR-10Percentage correct99CeiT-S
Image ClassificationCIFAR-10Percentage correct98.5CeiT-T
Image ClassificationOxford-IIIT PetsAccuracy94.9CeiT-S (384 finetune resolution)
Image ClassificationOxford-IIIT PetsAccuracy94.6CeiT-S
Image ClassificationOxford-IIIT PetsAccuracy94.5CeiT-T (384 finetune resolution)
Image ClassificationOxford-IIIT PetsAccuracy93.8CeiT-T
Image ClassificationFlowers-102Accuracy98.6CeiT-S (384 finetune resolution)
Image ClassificationFlowers-102Accuracy98.2CeiT-S
Image ClassificationFlowers-102Accuracy97.8CeiT-T (384 finetune resolution)
Image ClassificationFlowers-102Accuracy96.9CeiT-T
Image ClassificationiNaturalist 2019Top-1 Accuracy82.7CeiT-S (384 finetune resolution)
Image ClassificationiNaturalist 2019Top-1 Accuracy78.9CeiT-S
Image ClassificationiNaturalist 2019Top-1 Accuracy77.9CeiT-T (384 finetune resolution)
Image ClassificationiNaturalist 2019Top-1 Accuracy72.8CeiT-T
Image ClassificationCIFAR-100Percentage correct91.8CeiT-S
Image ClassificationCIFAR-100Percentage correct91.8CeiT-S (384 finetune resolution)
Image ClassificationCIFAR-100Percentage correct89.4CeiT-T
Image ClassificationCIFAR-100Percentage correct88CeiT-T (384 finetune resolution)
Image ClassificationImageNetGFLOPs12.9CeiT-S (384 finetune res)
Image ClassificationImageNetGFLOPs4.5CeiT-S
Image ClassificationImageNetGFLOPs3.6CeiT-T (384 finetune res)
Image ClassificationImageNetGFLOPs1.2CeiT-T

Related Papers

Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17Federated Learning for Commercial Image Sources2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17Hashed Watermark as a Filter: Defeating Forging and Overwriting Attacks in Weight-based Neural Network Watermarking2025-07-15Transferring Styles for Reduced Texture Bias and Improved Robustness in Semantic Segmentation Networks2025-07-14FedGSCA: Medical Federated Learning with Global Sample Selector and Client Adaptive Adjuster under Label Noise2025-07-13