MIM-Refiner: A Contrastive Learning Boost from Intermediate Pre-Trained Representations

Benedikt Alkin, Lukas Miklautz, Sepp Hochreiter, Johannes Brandstetter

2024-02-15Self-Supervised Image Classification Image Clustering Semantic Segmentation Contrastive Learning

Abstract

We introduce MIM (Masked Image Modeling)-Refiner, a contrastive learning boost for pre-trained MIM models. MIM-Refiner is motivated by the insight that strong representations within MIM models generally reside in intermediate layers. Accordingly, MIM-Refiner leverages multiple contrastive heads that are connected to different intermediate layers. In each head, a modified nearest neighbor objective constructs semantic clusters that capture semantic information which improves performance on downstream tasks, including off-the-shelf and fine-tuning settings. The refinement process is short and simple - yet highly effective. Within a few epochs, we refine the features of MIM models from subpar to state-of-the-art, off-the-shelf features. Refining a ViT-H, pre-trained with data2vec 2.0 on ImageNet-1K, sets a new state-of-the-art in linear probing (84.7%) and low-shot classification among models that are pre-trained on ImageNet-1K. MIM-Refiner efficiently combines the advantages of MIM and ID objectives and compares favorably against previous state-of-the-art SSL models on a variety of benchmarks such as low-shot classification, long-tailed classification, clustering and semantic segmentation.

Results

Task	Dataset	Metric	Value	Model
Image Clustering	ImageNet	ARI	42.2	MIM-Refiner (D2V2-ViT-H/14)
Image Clustering	ImageNet	Accuracy	67.3	MIM-Refiner (D2V2-ViT-H/14)
Image Clustering	ImageNet	NMI	87.2	MIM-Refiner (D2V2-ViT-H/14)
Image Clustering	ImageNet	ARI	45.5	MIM-Refiner (MAE-ViT-H/14)
Image Clustering	ImageNet	Accuracy	64.6	MIM-Refiner (MAE-ViT-H/14)
Image Clustering	ImageNet	NMI	85.3	MIM-Refiner (MAE-ViT-H/14)

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21 DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17 SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17 Unified Medical Image Segmentation with State Space Modeling Snake2025-07-17 A Privacy-Preserving Semantic-Segmentation Method Using Domain-Adaptation Technique2025-07-17 SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts2025-07-17 HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17 Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management2025-07-17