Fully Attentional Networks with Self-emerging Token Labeling

Bingyin Zhao, Zhiding Yu, Shiyi Lan, Yutao Cheng, Anima Anandkumar, Yingjie Lao, Jose M. Alvarez

2024-01-08ICCV 2023 1Semantic Segmentation

Abstract

Recent studies indicate that Vision Transformers (ViTs) are robust against out-of-distribution scenarios. In particular, the Fully Attentional Network (FAN) - a family of ViT backbones, has achieved state-of-the-art robustness. In this paper, we revisit the FAN models and improve their pre-training with a self-emerging token labeling (STL) framework. Our method contains a two-stage training framework. Specifically, we first train a FAN token labeler (FAN-TL) to generate semantically meaningful patch token labels, followed by a FAN student model training stage that uses both the token labels and the original class label. With the proposed STL framework, our best model based on FAN-L-Hybrid (77.3M parameters) achieves 84.8% Top-1 accuracy and 42.1% mCE on ImageNet-1K and ImageNet-C, and sets a new state-of-the-art for ImageNet-A (46.1%) and ImageNet-R (56.6%) without using extra data, outperforming the original FAN counterpart by significant margins. The proposed framework also demonstrates significantly enhanced performance on downstream tasks such as semantic segmentation, with up to 1.7% improvement in robustness over the counterpart model. Code is available at https://github.com/NVlabs/STL.

Results

Task	Dataset	Metric	Value	Model
Domain Adaptation	ImageNet-R	Top-1 Error Rate	43.4	FAN-L-Hybrid+STL
Domain Adaptation	ImageNet-A	Top-1 accuracy %	46.1	FAN-L-Hybrid+STL
Domain Adaptation	ImageNet-C	Top 1 Accuracy	69.2	FAN-L-Hybrid+STL
Domain Adaptation	ImageNet-C	mean Corruption Error (mCE)	42.1	FAN-L-Hybrid+STL
Semantic Segmentation	Cityscapes val	mIoU	82.8	FAN-L-Hybrid+STL
Domain Generalization	ImageNet-R	Top-1 Error Rate	43.4	FAN-L-Hybrid+STL
Domain Generalization	ImageNet-A	Top-1 accuracy %	46.1	FAN-L-Hybrid+STL
Domain Generalization	ImageNet-C	Top 1 Accuracy	69.2	FAN-L-Hybrid+STL
Domain Generalization	ImageNet-C	mean Corruption Error (mCE)	42.1	FAN-L-Hybrid+STL
10-shot image generation	Cityscapes val	mIoU	82.8	FAN-L-Hybrid+STL

Fully Attentional Networks with Self-emerging Token Labeling

Abstract

Results

Related Papers

Fully Attentional Networks with Self-emerging Token Labeling

Abstract

Results

Related Papers