Understanding The Robustness in Vision Transformers

Daquan Zhou, Zhiding Yu, Enze Xie, Chaowei Xiao, Anima Anandkumar, Jiashi Feng, Jose M. Alvarez

2022-04-26Image Classification Domain Generalization Semantic Segmentation object-detection Object Detection

Abstract

Recent studies show that Vision Transformers(ViTs) exhibit strong robustness against various corruptions. Although this property is partly attributed to the self-attention mechanism, there is still a lack of systematic understanding. In this paper, we examine the role of self-attention in learning robust representations. Our study is motivated by the intriguing properties of the emerging visual grouping in Vision Transformers, which indicates that self-attention may promote robustness through improved mid-level representations. We further propose a family of fully attentional networks (FANs) that strengthen this capability by incorporating an attentional channel processing design. We validate the design comprehensively on various hierarchical backbones. Our model achieves a state-of-the-art 87.1% accuracy and 35.8% mCE on ImageNet-1k and ImageNet-C with 76.8M parameters. We also demonstrate state-of-the-art accuracy and robustness in two downstream tasks: semantic segmentation and object detection. Code is available at: https://github.com/NVlabs/FAN.

Results

Task	Dataset	Metric	Value	Model
Domain Adaptation	ImageNet-R	Top-1 Error Rate	28.9	FAN-Hybrid-L(IN-21K, 384))
Domain Adaptation	ImageNet-A	Top-1 accuracy %	74.5	FAN-Hybrid-L(IN-21K, 384)
Domain Adaptation	ImageNet-C	Top 1 Accuracy	73.6	FAN-L-Hybrid (IN-22k)
Domain Adaptation	ImageNet-C	mean Corruption Error (mCE)	35.8	FAN-L-Hybrid (IN-22k)
Domain Adaptation	ImageNet-C	Top 1 Accuracy	70.5	FAN-B-Hybrid (IN-22k)
Domain Adaptation	ImageNet-C	mean Corruption Error (mCE)	41	FAN-B-Hybrid (IN-22k)
Domain Adaptation	ImageNet-C	Top 1 Accuracy	67.7	FAN-L-Hybrid
Domain Adaptation	ImageNet-C	mean Corruption Error (mCE)	43	FAN-L-Hybrid
Semantic Segmentation	Cityscapes val	mIoU	82.3	FAN-L-Hybrid
Object Detection	COCO minival	box AP	55.1	FAN-L-Hybrid
3D	COCO minival	box AP	55.1	FAN-L-Hybrid
2D Classification	COCO minival	box AP	55.1	FAN-L-Hybrid
2D Object Detection	COCO minival	box AP	55.1	FAN-L-Hybrid
Domain Generalization	ImageNet-R	Top-1 Error Rate	28.9	FAN-Hybrid-L(IN-21K, 384))
Domain Generalization	ImageNet-A	Top-1 accuracy %	74.5	FAN-Hybrid-L(IN-21K, 384)
Domain Generalization	ImageNet-C	Top 1 Accuracy	73.6	FAN-L-Hybrid (IN-22k)
Domain Generalization	ImageNet-C	mean Corruption Error (mCE)	35.8	FAN-L-Hybrid (IN-22k)
Domain Generalization	ImageNet-C	Top 1 Accuracy	70.5	FAN-B-Hybrid (IN-22k)
Domain Generalization	ImageNet-C	mean Corruption Error (mCE)	41	FAN-B-Hybrid (IN-22k)
Domain Generalization	ImageNet-C	Top 1 Accuracy	67.7	FAN-L-Hybrid
Domain Generalization	ImageNet-C	mean Corruption Error (mCE)	43	FAN-L-Hybrid
10-shot image generation	Cityscapes val	mIoU	82.3	FAN-L-Hybrid
16k	COCO minival	box AP	55.1	FAN-L-Hybrid

Understanding The Robustness in Vision Transformers

Abstract

Results

Related Papers

Understanding The Robustness in Vision Transformers

Abstract

Results

Related Papers