TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/TransNeXt: Robust Foveal Visual Perception for Vision Tran...

TransNeXt: Robust Foveal Visual Perception for Vision Transformers

Dai Shi

2023-11-28CVPR 2024 1Image ClassificationDomain GeneralizationSemantic SegmentationClassificationobject-detectionObject Detection
PaperPDFCode(official)CodeCodeCode

Abstract

Due to the depth degradation effect in residual connections, many efficient Vision Transformers models that rely on stacking layers for information exchange often fail to form sufficient information mixing, leading to unnatural visual perception. To address this issue, in this paper, we propose Aggregated Attention, a biomimetic design-based token mixer that simulates biological foveal vision and continuous eye movement while enabling each token on the feature map to have a global perception. Furthermore, we incorporate learnable tokens that interact with conventional queries and keys, which further diversifies the generation of affinity matrices beyond merely relying on the similarity between queries and keys. Our approach does not rely on stacking for information exchange, thus effectively avoiding depth degradation and achieving natural visual perception. Additionally, we propose Convolutional GLU, a channel mixer that bridges the gap between GLU and SE mechanism, which empowers each token to have channel attention based on its nearest neighbor image features, enhancing local modeling capability and model robustness. We combine aggregated attention and convolutional GLU to create a new visual backbone called TransNeXt. Extensive experiments demonstrate that our TransNeXt achieves state-of-the-art performance across multiple model sizes. At a resolution of $224^2$, TransNeXt-Tiny attains an ImageNet accuracy of 84.0%, surpassing ConvNeXt-B with 69% fewer parameters. Our TransNeXt-Base achieves an ImageNet accuracy of 86.2% and an ImageNet-A accuracy of 61.6% at a resolution of $384^2$, a COCO object detection mAP of 57.1, and an ADE20K semantic segmentation mIoU of 54.7.

Results

TaskDatasetMetricValueModel
Domain AdaptationImageNet-ATop-1 accuracy %61.6TransNeXt-Base (IN-1K supervised, 384)
Domain AdaptationImageNet-ATop-1 accuracy %58.3TransNeXt-Small (IN-1K supervised, 384)
Domain AdaptationImageNet-ATop-1 accuracy %50.6TransNeXt-Base (IN-1K supervised, 224)
Domain AdaptationImageNet-ATop-1 accuracy %47.1TransNeXt-Small (IN-1K supervised, 224)
Semantic SegmentationADE20KParams (M)109TransNeXt-Base (IN-1K pretrain, Mask2Former, 512)
Semantic SegmentationADE20KValidation mIoU54.7TransNeXt-Base (IN-1K pretrain, Mask2Former, 512)
Semantic SegmentationADE20KParams (M)69TransNeXt-Small (IN-1K pretrain, Mask2Former, 512)
Semantic SegmentationADE20KValidation mIoU54.1TransNeXt-Small (IN-1K pretrain, Mask2Former, 512)
Semantic SegmentationADE20KParams (M)47.5TransNeXt-Tiny (IN-1K pretrain, Mask2Former, 512)
Semantic SegmentationADE20KValidation mIoU53.4TransNeXt-Tiny (IN-1K pretrain, Mask2Former, 512)
Object DetectionCOCO minivalbox AP57.1TransNeXt-Base (IN-1K pretrain, DINO 1x)
Object DetectionCOCO minivalbox AP56.6TransNeXt-Small (IN-1K pretrain, DINO 1x)
Object DetectionCOCO minivalbox AP55.7TransNeXt-Tiny (IN-1K pretrain, DINO 1x)
Image ClassificationImageNetGFLOPs56.3TransNeXt-Base (IN-1K supervised, 384)
Image ClassificationImageNetGFLOPs32.1TransNeXt-Small (IN-1K supervised, 384)
Image ClassificationImageNetGFLOPs10.3TransNeXt-Small (IN-1K supervised, 224)
Image ClassificationImageNetGFLOPs5.7TransNeXt-Tiny (IN-1K supervised, 224)
Image ClassificationImageNetGFLOPs2.7TransNeXt-Micro (IN-1K supervised, 224)
3DCOCO minivalbox AP57.1TransNeXt-Base (IN-1K pretrain, DINO 1x)
3DCOCO minivalbox AP56.6TransNeXt-Small (IN-1K pretrain, DINO 1x)
3DCOCO minivalbox AP55.7TransNeXt-Tiny (IN-1K pretrain, DINO 1x)
2D ClassificationCOCO minivalbox AP57.1TransNeXt-Base (IN-1K pretrain, DINO 1x)
2D ClassificationCOCO minivalbox AP56.6TransNeXt-Small (IN-1K pretrain, DINO 1x)
2D ClassificationCOCO minivalbox AP55.7TransNeXt-Tiny (IN-1K pretrain, DINO 1x)
2D Object DetectionCOCO minivalbox AP57.1TransNeXt-Base (IN-1K pretrain, DINO 1x)
2D Object DetectionCOCO minivalbox AP56.6TransNeXt-Small (IN-1K pretrain, DINO 1x)
2D Object DetectionCOCO minivalbox AP55.7TransNeXt-Tiny (IN-1K pretrain, DINO 1x)
Domain GeneralizationImageNet-ATop-1 accuracy %61.6TransNeXt-Base (IN-1K supervised, 384)
Domain GeneralizationImageNet-ATop-1 accuracy %58.3TransNeXt-Small (IN-1K supervised, 384)
Domain GeneralizationImageNet-ATop-1 accuracy %50.6TransNeXt-Base (IN-1K supervised, 224)
Domain GeneralizationImageNet-ATop-1 accuracy %47.1TransNeXt-Small (IN-1K supervised, 224)
10-shot image generationADE20KParams (M)109TransNeXt-Base (IN-1K pretrain, Mask2Former, 512)
10-shot image generationADE20KValidation mIoU54.7TransNeXt-Base (IN-1K pretrain, Mask2Former, 512)
10-shot image generationADE20KParams (M)69TransNeXt-Small (IN-1K pretrain, Mask2Former, 512)
10-shot image generationADE20KValidation mIoU54.1TransNeXt-Small (IN-1K pretrain, Mask2Former, 512)
10-shot image generationADE20KParams (M)47.5TransNeXt-Tiny (IN-1K pretrain, Mask2Former, 512)
10-shot image generationADE20KValidation mIoU53.4TransNeXt-Tiny (IN-1K pretrain, Mask2Former, 512)
16kCOCO minivalbox AP57.1TransNeXt-Base (IN-1K pretrain, DINO 1x)
16kCOCO minivalbox AP56.6TransNeXt-Small (IN-1K pretrain, DINO 1x)
16kCOCO minivalbox AP55.7TransNeXt-Tiny (IN-1K pretrain, DINO 1x)

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17Federated Learning for Commercial Image Sources2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17Simulate, Refocus and Ensemble: An Attention-Refocusing Scheme for Domain Generalization2025-07-17GLAD: Generalizable Tuning for Vision-Language Models2025-07-17