TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Could Giant Pretrained Image Models Extract Universal Repr...

Could Giant Pretrained Image Models Extract Universal Representations?

Yutong Lin, Ze Liu, Zheng Zhang, Han Hu, Nanning Zheng, Stephen Lin, Yue Cao

2022-11-03Transfer LearningSemantic SegmentationInstance SegmentationAction RecognitionAction Recognition In VideosTemporal Action Localizationobject-detectionObject Detection
PaperPDF

Abstract

Frozen pretrained models have become a viable alternative to the pretraining-then-finetuning paradigm for transfer learning. However, with frozen models there are relatively few parameters available for adapting to downstream tasks, which is problematic in computer vision where tasks vary significantly in input/output format and the type of information that is of value. In this paper, we present a study of frozen pretrained models when applied to diverse and representative computer vision tasks, including object detection, semantic segmentation and video action recognition. From this empirical analysis, our work answers the questions of what pretraining task fits best with this frozen setting, how to make the frozen setting more flexible to various downstream tasks, and the effect of larger model sizes. We additionally examine the upper bound of performance using a giant frozen pretrained model with 3 billion parameters (SwinV2-G) and find that it reaches competitive performance on a varied set of major benchmarks with only one shared frozen base network: 60.0 box mAP and 52.2 mask mAP on COCO object detection test-dev, 57.6 val mIoU on ADE20K semantic segmentation, and 81.7 top-1 accuracy on Kinetics-400 action recognition. With this work, we hope to bring greater attention to this promising path of freezing pretrained image models.

Results

TaskDatasetMetricValueModel
Activity RecognitionKinetics-400Top-1 Accuracy81.7Frozen Backbone, SwinV2-G-ext22K (Video-Swin)
Semantic SegmentationADE20KValidation mIoU57.6Frozen Backbone, SwinV2-G-ext22K (Mask2Former)
Object DetectionCOCO minivalbox AP59.3Frozen Backbone, SwinV2-G-ext22K (HTC)
3DCOCO minivalbox AP59.3Frozen Backbone, SwinV2-G-ext22K (HTC)
Instance SegmentationCOCO minivalmask AP51.6Frozen Backbone, SwinV2-G-ext22K (HTC)
Action RecognitionKinetics-400Top-1 Accuracy81.7Frozen Backbone, SwinV2-G-ext22K (Video-Swin)
2D ClassificationCOCO minivalbox AP59.3Frozen Backbone, SwinV2-G-ext22K (HTC)
2D Object DetectionCOCO minivalbox AP59.3Frozen Backbone, SwinV2-G-ext22K (HTC)
Action Recognition In VideosKinetics-400Top-1 Accuracy81.7Frozen Backbone, SwinV2-G-ext22K (Video-Swin)
10-shot image generationADE20KValidation mIoU57.6Frozen Backbone, SwinV2-G-ext22K (Mask2Former)
16kCOCO minivalbox AP59.3Frozen Backbone, SwinV2-G-ext22K (HTC)

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21RaMen: Multi-Strategy Multi-Modal Learning for Bundle Construction2025-07-18Disentangling coincident cell events using deep transfer learning and compressive sensing2025-07-17DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17Unified Medical Image Segmentation with State Space Modeling Snake2025-07-17A Privacy-Preserving Semantic-Segmentation Method Using Domain-Adaptation Technique2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17