Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time

Mitchell Wortsman, Gabriel Ilharco, Samir Yitzhak Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S. Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, Ludwig Schmidt

2022-03-10Image Classification Domain Generalization Unsupervised Domain Adaptation Out-of-Distribution Generalization

Paper PDF Code Code Code Code Code Code(official)

Abstract

The conventional recipe for maximizing model accuracy is to (1) train multiple models with various hyperparameters and (2) pick the individual model which performs best on a held-out validation set, discarding the remainder. In this paper, we revisit the second step of this procedure in the context of fine-tuning large pre-trained models, where fine-tuned models often appear to lie in a single low error basin. We show that averaging the weights of multiple models fine-tuned with different hyperparameter configurations often improves accuracy and robustness. Unlike a conventional ensemble, we may average many models without incurring any additional inference or memory costs -- we call the results "model soups." When fine-tuning large pre-trained models such as CLIP, ALIGN, and a ViT-G pre-trained on JFT, our soup recipe provides significant improvements over the best model in a hyperparameter sweep on ImageNet. The resulting ViT-G model, which attains 90.94% top-1 accuracy on ImageNet, achieved a new state of the art. Furthermore, we show that the model soup approach extends to multiple image classification and natural language processing tasks, improves out-of-distribution performance, and improves zero-shot performance on new downstream tasks. Finally, we analytically relate the performance similarity of weight-averaging and logit-ensembling to flatness of the loss and confidence of the predictions, and validate this relation empirically. Code is available at https://github.com/mlfoundations/model-soups.

Results

Task	Dataset	Metric	Value	Model
Domain Adaptation	ImageNet-R	Top 1 Error	4.54	Model soups (ViT-G/14)
Domain Adaptation	ImageNet-R	Top-1 Error Rate	3.9	Model soups (BASIC-L)
Domain Adaptation	ImageNet-R	Top-1 Error Rate	4.54	Model soups (ViT-G/14)
Domain Adaptation	ImageNet-A	Top-1 accuracy %	94.17	Model soups (BASIC-L)
Domain Adaptation	ImageNet-A	Top-1 accuracy %	92.67	Model soups (ViT-G/14)
Domain Adaptation	ImageNet-Sketch	Top-1 accuracy	77.18	Model soups (BASIC-L)
Domain Adaptation	ImageNet-Sketch	Top-1 accuracy	74.24	Model soups (ViT-G/14)
Image Classification	ImageNet V2	Top 1 Accuracy	84.63	Model soups (BASIC-L)
Image Classification	ImageNet V2	Top 1 Accuracy	84.22	Model soups (ViT-G/14)
Image Classification	ObjectNet	Top-1 Accuracy	79.03	Baseline (ViT-G/14)
Image Classification	ObjectNet	Top-1 Accuracy	78.52	Model soups (ViT-G/14)
Unsupervised Domain Adaptation	ImageNet-R	Top 1 Error	4.54	Model soups (ViT-G/14)
Domain Generalization	ImageNet-R	Top-1 Error Rate	3.9	Model soups (BASIC-L)
Domain Generalization	ImageNet-R	Top-1 Error Rate	4.54	Model soups (ViT-G/14)
Domain Generalization	ImageNet-A	Top-1 accuracy %	94.17	Model soups (BASIC-L)
Domain Generalization	ImageNet-A	Top-1 accuracy %	92.67	Model soups (ViT-G/14)
Domain Generalization	ImageNet-Sketch	Top-1 accuracy	77.18	Model soups (BASIC-L)
Domain Generalization	ImageNet-Sketch	Top-1 accuracy	74.24	Model soups (ViT-G/14)

Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time

Abstract

Results

Related Papers

Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time

Abstract

Results

Related Papers