TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/ConvNeXt V2: Co-designing and Scaling ConvNets with Masked...

ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders

Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, Saining Xie

2023-01-02CVPR 2023 1Representation LearningSelf-Supervised LearningSemantic SegmentationObject Detection
PaperPDFCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCode(official)

Abstract

Driven by improved architectures and better representation learning frameworks, the field of visual recognition has enjoyed rapid modernization and performance boost in the early 2020s. For example, modern ConvNets, represented by ConvNeXt, have demonstrated strong performance in various scenarios. While these models were originally designed for supervised learning with ImageNet labels, they can also potentially benefit from self-supervised learning techniques such as masked autoencoders (MAE). However, we found that simply combining these two approaches leads to subpar performance. In this paper, we propose a fully convolutional masked autoencoder framework and a new Global Response Normalization (GRN) layer that can be added to the ConvNeXt architecture to enhance inter-channel feature competition. This co-design of self-supervised learning techniques and architectural improvement results in a new model family called ConvNeXt V2, which significantly improves the performance of pure ConvNets on various recognition benchmarks, including ImageNet classification, COCO detection, and ADE20K segmentation. We also provide pre-trained ConvNeXt V2 models of various sizes, ranging from an efficient 3.7M-parameter Atto model with 76.7% top-1 accuracy on ImageNet, to a 650M Huge model that achieves a state-of-the-art 88.9% accuracy using only public training data.

Results

TaskDatasetMetricValueModel
Semantic SegmentationADE20KValidation mIoU55ConvNeXt V2-H (FCMAE)
Semantic SegmentationADE20KValidation mIoU54.2Swin V2-H
Semantic SegmentationADE20KValidation mIoU53.7ConvNeXt V2-L
Semantic SegmentationADE20KValidation mIoU53.5Swin-L
Semantic SegmentationADE20KValidation mIoU52.8Swin-B
Semantic SegmentationADE20KValidation mIoU52.1ConvNeXt V2-B
Semantic SegmentationADE20KValidation mIoU51.6ConvNeXt V2-L (Supervised)
Semantic SegmentationADE20KValidation mIoU50.5ConvNeXt V1-L
Semantic SegmentationADE20KValidation mIoU49.9ConvNeXt V1-B
10-shot image generationADE20KValidation mIoU55ConvNeXt V2-H (FCMAE)
10-shot image generationADE20KValidation mIoU54.2Swin V2-H
10-shot image generationADE20KValidation mIoU53.7ConvNeXt V2-L
10-shot image generationADE20KValidation mIoU53.5Swin-L
10-shot image generationADE20KValidation mIoU52.8Swin-B
10-shot image generationADE20KValidation mIoU52.1ConvNeXt V2-B
10-shot image generationADE20KValidation mIoU51.6ConvNeXt V2-L (Supervised)
10-shot image generationADE20KValidation mIoU50.5ConvNeXt V1-L
10-shot image generationADE20KValidation mIoU49.9ConvNeXt V1-B

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper2025-07-20Spectral Bellman Method: Unifying Representation and Exploration in RL2025-07-17Boosting Team Modeling through Tempo-Relational Representation Learning2025-07-17A Semi-Supervised Learning Method for the Identification of Bad Exposures in Large Imaging Surveys2025-07-17DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17Unified Medical Image Segmentation with State Space Modeling Snake2025-07-17