TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Fully Attentional Networks with Self-emerging Token Labeling

Fully Attentional Networks with Self-emerging Token Labeling

Bingyin Zhao, Zhiding Yu, Shiyi Lan, Yutao Cheng, Anima Anandkumar, Yingjie Lao, Jose M. Alvarez

2024-01-08ICCV 2023 1Semantic Segmentation
PaperPDFCode(official)

Abstract

Recent studies indicate that Vision Transformers (ViTs) are robust against out-of-distribution scenarios. In particular, the Fully Attentional Network (FAN) - a family of ViT backbones, has achieved state-of-the-art robustness. In this paper, we revisit the FAN models and improve their pre-training with a self-emerging token labeling (STL) framework. Our method contains a two-stage training framework. Specifically, we first train a FAN token labeler (FAN-TL) to generate semantically meaningful patch token labels, followed by a FAN student model training stage that uses both the token labels and the original class label. With the proposed STL framework, our best model based on FAN-L-Hybrid (77.3M parameters) achieves 84.8% Top-1 accuracy and 42.1% mCE on ImageNet-1K and ImageNet-C, and sets a new state-of-the-art for ImageNet-A (46.1%) and ImageNet-R (56.6%) without using extra data, outperforming the original FAN counterpart by significant margins. The proposed framework also demonstrates significantly enhanced performance on downstream tasks such as semantic segmentation, with up to 1.7% improvement in robustness over the counterpart model. Code is available at https://github.com/NVlabs/STL.

Results

TaskDatasetMetricValueModel
Domain AdaptationImageNet-RTop-1 Error Rate43.4FAN-L-Hybrid+STL
Domain AdaptationImageNet-ATop-1 accuracy %46.1FAN-L-Hybrid+STL
Domain AdaptationImageNet-CTop 1 Accuracy69.2FAN-L-Hybrid+STL
Domain AdaptationImageNet-Cmean Corruption Error (mCE)42.1FAN-L-Hybrid+STL
Semantic SegmentationCityscapes valmIoU82.8FAN-L-Hybrid+STL
Domain GeneralizationImageNet-RTop-1 Error Rate43.4FAN-L-Hybrid+STL
Domain GeneralizationImageNet-ATop-1 accuracy %46.1FAN-L-Hybrid+STL
Domain GeneralizationImageNet-CTop 1 Accuracy69.2FAN-L-Hybrid+STL
Domain GeneralizationImageNet-Cmean Corruption Error (mCE)42.1FAN-L-Hybrid+STL
10-shot image generationCityscapes valmIoU82.8FAN-L-Hybrid+STL

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17Unified Medical Image Segmentation with State Space Modeling Snake2025-07-17A Privacy-Preserving Semantic-Segmentation Method Using Domain-Adaptation Technique2025-07-17SAMST: A Transformer framework based on SAM pseudo label filtering for remote sensing semi-supervised semantic segmentation2025-07-16Tomato Multi-Angle Multi-Pose Dataset for Fine-Grained Phenotyping2025-07-15U-RWKV: Lightweight medical image segmentation with direction-adaptive RWKV2025-07-15