Conformer: Local Features Coupling Global Representations for Visual Recognition

Zhiliang Peng, Wei Huang, Shanzhi Gu, Lingxi Xie, YaoWei Wang, Jianbin Jiao, Qixiang Ye

2021-05-09ICCV 2021 10Image Classification Representation Learning Semantic Segmentation Instance Segmentation object-detection Object Detection

Paper PDF Code Code Code(official)Code

Abstract

Within Convolutional Neural Network (CNN), the convolution operations are good at extracting local features but experience difficulty to capture global representations. Within visual transformer, the cascaded self-attention modules can capture long-distance feature dependencies but unfortunately deteriorate local feature details. In this paper, we propose a hybrid network structure, termed Conformer, to take advantage of convolutional operations and self-attention mechanisms for enhanced representation learning. Conformer roots in the Feature Coupling Unit (FCU), which fuses local features and global representations under different resolutions in an interactive fashion. Conformer adopts a concurrent structure so that local features and global representations are retained to the maximum extent. Experiments show that Conformer, under the comparable parameter complexity, outperforms the visual transformer (DeiT-B) by 2.3% on ImageNet. On MSCOCO, it outperforms ResNet-101 by 3.7% and 3.6% mAPs for object detection and instance segmentation, respectively, demonstrating the great potential to be a general backbone network. Code is available at https://github.com/pengzhiliang/Conformer.

Results

Task	Dataset	Metric	Value	Model
Image Classification	ImageNet	GFLOPs	46.6	Conformer-B

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21 Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper2025-07-20 Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18 Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17 Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17 Federated Learning for Commercial Image Sources2025-07-17 MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17 Spectral Bellman Method: Unifying Representation and Exploration in RL2025-07-17