Could Giant Pretrained Image Models Extract Universal Representations?

Yutong Lin, Ze Liu, Zheng Zhang, Han Hu, Nanning Zheng, Stephen Lin, Yue Cao

2022-11-03Transfer Learning Semantic Segmentation Instance Segmentation Action Recognition Action Recognition In Videos Temporal Action Localization object-detection Object Detection

Paper PDF

Abstract

Frozen pretrained models have become a viable alternative to the pretraining-then-finetuning paradigm for transfer learning. However, with frozen models there are relatively few parameters available for adapting to downstream tasks, which is problematic in computer vision where tasks vary significantly in input/output format and the type of information that is of value. In this paper, we present a study of frozen pretrained models when applied to diverse and representative computer vision tasks, including object detection, semantic segmentation and video action recognition. From this empirical analysis, our work answers the questions of what pretraining task fits best with this frozen setting, how to make the frozen setting more flexible to various downstream tasks, and the effect of larger model sizes. We additionally examine the upper bound of performance using a giant frozen pretrained model with 3 billion parameters (SwinV2-G) and find that it reaches competitive performance on a varied set of major benchmarks with only one shared frozen base network: 60.0 box mAP and 52.2 mask mAP on COCO object detection test-dev, 57.6 val mIoU on ADE20K semantic segmentation, and 81.7 top-1 accuracy on Kinetics-400 action recognition. With this work, we hope to bring greater attention to this promising path of freezing pretrained image models.

Results

Task	Dataset	Metric	Value	Model
Activity Recognition	Kinetics-400	Top-1 Accuracy	81.7	Frozen Backbone, SwinV2-G-ext22K (Video-Swin)
Semantic Segmentation	ADE20K	Validation mIoU	57.6	Frozen Backbone, SwinV2-G-ext22K (Mask2Former)
Object Detection	COCO minival	box AP	59.3	Frozen Backbone, SwinV2-G-ext22K (HTC)
3D	COCO minival	box AP	59.3	Frozen Backbone, SwinV2-G-ext22K (HTC)
Instance Segmentation	COCO minival	mask AP	51.6	Frozen Backbone, SwinV2-G-ext22K (HTC)
Action Recognition	Kinetics-400	Top-1 Accuracy	81.7	Frozen Backbone, SwinV2-G-ext22K (Video-Swin)
2D Classification	COCO minival	box AP	59.3	Frozen Backbone, SwinV2-G-ext22K (HTC)
2D Object Detection	COCO minival	box AP	59.3	Frozen Backbone, SwinV2-G-ext22K (HTC)
Action Recognition In Videos	Kinetics-400	Top-1 Accuracy	81.7	Frozen Backbone, SwinV2-G-ext22K (Video-Swin)
10-shot image generation	ADE20K	Validation mIoU	57.6	Frozen Backbone, SwinV2-G-ext22K (Mask2Former)
16k	COCO minival	box AP	59.3	Frozen Backbone, SwinV2-G-ext22K (HTC)

Could Giant Pretrained Image Models Extract Universal Representations?

Abstract

Results

Related Papers

Could Giant Pretrained Image Models Extract Universal Representations?

Abstract

Results

Related Papers