Optimizing Relevance Maps of Vision Transformers Improves Robustness

Hila Chefer, Idan Schwartz, Lior Wolf

2022-06-02Image Classification Out-of-Distribution Generalization

Abstract

It has been observed that visual classification models often rely mostly on the image background, neglecting the foreground, which hurts their robustness to distribution changes. To alleviate this shortcoming, we propose to monitor the model's relevancy signal and manipulate it such that the model is focused on the foreground object. This is done as a finetuning step, involving relatively few samples consisting of pairs of images and their associated foreground masks. Specifically, we encourage the model's relevancy map (i) to assign lower relevance to background regions, (ii) to consider as much information as possible from the foreground, and (iii) we encourage the decisions to have high confidence. When applied to Vision Transformer (ViT) models, a marked improvement in robustness to domain shifts is observed. Moreover, the foreground masks can be obtained automatically, from a self-supervised variant of the ViT model itself; therefore no additional supervision is required.

Results

Task	Dataset	Metric	Value	Model
Image Classification	ObjectNet	Top-1 Accuracy	52	AR-L (Opt Relevance)
Image Classification	ObjectNet	Top-5 Accuracy	73.5	AR-L (Opt Relevance)
Image Classification	ObjectNet	Top-1 Accuracy	47.1	AR-B (Opt Relevance)
Image Classification	ObjectNet	Top-5 Accuracy	70	AR-B (Opt Relevance)
Image Classification	ObjectNet	Top-1 Accuracy	46.5	AR-L
Image Classification	ObjectNet	Top-5 Accuracy	68.3	AR-L
Image Classification	ObjectNet	Top-1 Accuracy	43.2	ViT-L (Opt Relevance)
Image Classification	ObjectNet	Top-5 Accuracy	65.8	ViT-L (Opt Relevance)
Image Classification	ObjectNet	Top-1 Accuracy	42.2	ViT-B (Opt Relevance)
Image Classification	ObjectNet	Top-5 Accuracy	65.1	ViT-B (Opt Relevance)
Image Classification	ObjectNet	Top-1 Accuracy	41.4	AR-B
Image Classification	ObjectNet	Top-5 Accuracy	63.7	AR-B
Image Classification	ObjectNet	Top-1 Accuracy	39.3	AR-S (Opt Relevance)
Image Classification	ObjectNet	Top-5 Accuracy	61.7	AR-S (Opt Relevance)
Image Classification	ObjectNet	Top-1 Accuracy	37.4	ViT-L
Image Classification	ObjectNet	Top-5 Accuracy	59.5	ViT-L
Image Classification	ObjectNet	Top-1 Accuracy	36.3	DeiT-L (Opt Relevance)
Image Classification	ObjectNet	Top-5 Accuracy	56.6	DeiT-L (Opt Relevance)
Image Classification	ObjectNet	Top-1 Accuracy	35.1	ViT-B
Image Classification	ObjectNet	Top-5 Accuracy	56.4	ViT-B
Image Classification	ObjectNet	Top-1 Accuracy	34.3	AR-S
Image Classification	ObjectNet	Top-5 Accuracy	55.8	AR-S
Image Classification	ObjectNet	Top-1 Accuracy	31.6	DeiT-S (Opt Relevance)
Image Classification	ObjectNet	Top-5 Accuracy	53	DeiT-S (Opt Relevance)
Image Classification	ObjectNet	Top-1 Accuracy	31.4	DeiT-L
Image Classification	ObjectNet	Top-5 Accuracy	48.5	DeiT-L
Image Classification	ObjectNet	Top-1 Accuracy	28.3	DeiT-S
Image Classification	ObjectNet	Top-5 Accuracy	47.3	DeiT-S

Optimizing Relevance Maps of Vision Transformers Improves Robustness

Abstract

Results

Related Papers

Optimizing Relevance Maps of Vision Transformers Improves Robustness

Abstract

Results

Related Papers