TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/A ConvNet for the 2020s

A ConvNet for the 2020s

Zhuang Liu, Hanzi Mao, Chao-yuan Wu, Christoph Feichtenhofer, Trevor Darrell, Saining Xie

2022-01-10CVPR 2022 1Image ClassificationDomain GeneralizationReal-Time Object DetectionSemantic SegmentationClassificationObject Detection
PaperPDFCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCode(official)CodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCode

Abstract

The "Roaring 20s" of visual recognition began with the introduction of Vision Transformers (ViTs), which quickly superseded ConvNets as the state-of-the-art image classification model. A vanilla ViT, on the other hand, faces difficulties when applied to general computer vision tasks such as object detection and semantic segmentation. It is the hierarchical Transformers (e.g., Swin Transformers) that reintroduced several ConvNet priors, making Transformers practically viable as a generic vision backbone and demonstrating remarkable performance on a wide variety of vision tasks. However, the effectiveness of such hybrid approaches is still largely credited to the intrinsic superiority of Transformers, rather than the inherent inductive biases of convolutions. In this work, we reexamine the design spaces and test the limits of what a pure ConvNet can achieve. We gradually "modernize" a standard ResNet toward the design of a vision Transformer, and discover several key components that contribute to the performance difference along the way. The outcome of this exploration is a family of pure ConvNet models dubbed ConvNeXt. Constructed entirely from standard ConvNet modules, ConvNeXts compete favorably with Transformers in terms of accuracy and scalability, achieving 87.8% ImageNet top-1 accuracy and outperforming Swin Transformers on COCO detection and ADE20K segmentation, while maintaining the simplicity and efficiency of standard ConvNets.

Results

TaskDatasetMetricValueModel
Domain AdaptationImageNet-RTop-1 Error Rate31.8ConvNeXt-XL (Im21k, 384)
Domain AdaptationImageNet-ATop-1 accuracy %69.3ConvNeXt-XL (Im21k, 384)
Domain AdaptationImageNet-Cmean Corruption Error (mCE)38.8ConvNeXt-XL (Im21k) (augmentation overlap with ImageNet-C)
Domain AdaptationVizWiz-ClassificationAccuracy - All Images53.5ConvNeXt-B
Domain AdaptationVizWiz-ClassificationAccuracy - Clean Images56ConvNeXt-B
Domain AdaptationVizWiz-ClassificationAccuracy - Corrupted Images46.9ConvNeXt-B
Domain AdaptationImageNet-SketchTop-1 accuracy55ConvNeXt-XL (Im21k, 384)
Semantic SegmentationImageNet-SmIoU (test)48.8ConvNext-Tiny (P4, 224x224, SUP)
Semantic SegmentationImageNet-SmIoU (val)48.7ConvNext-Tiny (P4, 224x224, SUP)
Semantic SegmentationADE20KGFLOPs (512 x 512)3335ConvNeXt-XL++
Semantic SegmentationADE20KParams (M)391ConvNeXt-XL++
Semantic SegmentationADE20KValidation mIoU54ConvNeXt-XL++
Semantic SegmentationADE20KGFLOPs (512 x 512)2458ConvNeXt-L++
Semantic SegmentationADE20KParams (M)235ConvNeXt-L++
Semantic SegmentationADE20KValidation mIoU53.7ConvNeXt-L++
Semantic SegmentationADE20KGFLOPs (512 x 512)1828ConvNeXt-B++
Semantic SegmentationADE20KParams (M)122ConvNeXt-B++
Semantic SegmentationADE20KValidation mIoU53.1ConvNeXt-B++
Semantic SegmentationADE20KGFLOPs (512 x 512)1170ConvNeXt-B
Semantic SegmentationADE20KParams (M)122ConvNeXt-B
Semantic SegmentationADE20KValidation mIoU49.9ConvNeXt-B
Semantic SegmentationADE20KGFLOPs (512 x 512)1027ConvNeXt-S
Semantic SegmentationADE20KParams (M)82ConvNeXt-S
Semantic SegmentationADE20KValidation mIoU49.6ConvNeXt-S
Semantic SegmentationADE20KGFLOPs (512 x 512)939ConvNeXt-T
Semantic SegmentationADE20KParams (M)60ConvNeXt-T
Semantic SegmentationADE20KValidation mIoU46.7ConvNeXt-T
Object DetectionCOCO-OAverage mAP37.5ConvNeXt-XL (Cascade Mask R-CNN)
Object DetectionCOCO-OEffective Robustness12.68ConvNeXt-XL (Cascade Mask R-CNN)
Image ClassificationImageNetGFLOPs179ConvNeXt-XL (ImageNet-22k)
Image ClassificationImageNetGFLOPs101ConvNeXt-L (384 res)
Image ClassificationImageNetGFLOPs4.5ConvNeXt-T
3DCOCO-OAverage mAP37.5ConvNeXt-XL (Cascade Mask R-CNN)
3DCOCO-OEffective Robustness12.68ConvNeXt-XL (Cascade Mask R-CNN)
2D ClassificationCOCO-OAverage mAP37.5ConvNeXt-XL (Cascade Mask R-CNN)
2D ClassificationCOCO-OEffective Robustness12.68ConvNeXt-XL (Cascade Mask R-CNN)
2D Object DetectionCOCO-OAverage mAP37.5ConvNeXt-XL (Cascade Mask R-CNN)
2D Object DetectionCOCO-OEffective Robustness12.68ConvNeXt-XL (Cascade Mask R-CNN)
Domain GeneralizationImageNet-RTop-1 Error Rate31.8ConvNeXt-XL (Im21k, 384)
Domain GeneralizationImageNet-ATop-1 accuracy %69.3ConvNeXt-XL (Im21k, 384)
Domain GeneralizationImageNet-Cmean Corruption Error (mCE)38.8ConvNeXt-XL (Im21k) (augmentation overlap with ImageNet-C)
Domain GeneralizationVizWiz-ClassificationAccuracy - All Images53.5ConvNeXt-B
Domain GeneralizationVizWiz-ClassificationAccuracy - Clean Images56ConvNeXt-B
Domain GeneralizationVizWiz-ClassificationAccuracy - Corrupted Images46.9ConvNeXt-B
Domain GeneralizationImageNet-SketchTop-1 accuracy55ConvNeXt-XL (Im21k, 384)
10-shot image generationImageNet-SmIoU (test)48.8ConvNext-Tiny (P4, 224x224, SUP)
10-shot image generationImageNet-SmIoU (val)48.7ConvNext-Tiny (P4, 224x224, SUP)
10-shot image generationADE20KGFLOPs (512 x 512)3335ConvNeXt-XL++
10-shot image generationADE20KParams (M)391ConvNeXt-XL++
10-shot image generationADE20KValidation mIoU54ConvNeXt-XL++
10-shot image generationADE20KGFLOPs (512 x 512)2458ConvNeXt-L++
10-shot image generationADE20KParams (M)235ConvNeXt-L++
10-shot image generationADE20KValidation mIoU53.7ConvNeXt-L++
10-shot image generationADE20KGFLOPs (512 x 512)1828ConvNeXt-B++
10-shot image generationADE20KParams (M)122ConvNeXt-B++
10-shot image generationADE20KValidation mIoU53.1ConvNeXt-B++
10-shot image generationADE20KGFLOPs (512 x 512)1170ConvNeXt-B
10-shot image generationADE20KParams (M)122ConvNeXt-B
10-shot image generationADE20KValidation mIoU49.9ConvNeXt-B
10-shot image generationADE20KGFLOPs (512 x 512)1027ConvNeXt-S
10-shot image generationADE20KParams (M)82ConvNeXt-S
10-shot image generationADE20KValidation mIoU49.6ConvNeXt-S
10-shot image generationADE20KGFLOPs (512 x 512)939ConvNeXt-T
10-shot image generationADE20KParams (M)60ConvNeXt-T
10-shot image generationADE20KValidation mIoU46.7ConvNeXt-T
16kCOCO-OAverage mAP37.5ConvNeXt-XL (Cascade Mask R-CNN)
16kCOCO-OEffective Robustness12.68ConvNeXt-XL (Cascade Mask R-CNN)

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17Federated Learning for Commercial Image Sources2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17Simulate, Refocus and Ensemble: An Attention-Refocusing Scheme for Domain Generalization2025-07-17GLAD: Generalizable Tuning for Vision-Language Models2025-07-17