TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/iBOT: Image BERT Pre-Training with Online Tokenizer

iBOT: Image BERT Pre-Training with Online Tokenizer

Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, Tao Kong

2021-11-15Self-Supervised Image ClassificationImage ClassificationMasked Language ModelingSemantic SegmentationInstance SegmentationUnsupervised Image ClassificationObject DetectionLanguage ModellingSemi-Supervised Image Classification
PaperPDFCode(official)Code

Abstract

The success of language Transformers is primarily attributed to the pretext task of masked language modeling (MLM), where texts are first tokenized into semantically meaningful pieces. In this work, we study masked image modeling (MIM) and indicate the advantages and challenges of using a semantically meaningful visual tokenizer. We present a self-supervised framework iBOT that can perform masked prediction with an online tokenizer. Specifically, we perform self-distillation on masked patch tokens and take the teacher network as the online tokenizer, along with self-distillation on the class token to acquire visual semantics. The online tokenizer is jointly learnable with the MIM objective and dispenses with a multi-stage training pipeline where the tokenizer needs to be pre-trained beforehand. We show the prominence of iBOT by achieving an 82.3% linear probing accuracy and an 87.8% fine-tuning accuracy evaluated on ImageNet-1K. Beyond the state-of-the-art image classification results, we underline emerging local semantic patterns, which helps the models to obtain strong robustness against common corruptions and achieve leading results on dense downstream tasks, eg., object detection, instance segmentation, and semantic segmentation.

Results

TaskDatasetMetricValueModel
Semantic SegmentationADE20KValidation mIoU50iBOT (ViT-B/16)
Semantic SegmentationADE20KValidation mIoU45.4iBOT (ViT-S/16)
Semantic SegmentationADE20KValidation mIoU38.3iBOT (ViT-B/16) (linear head)
Object DetectionCOCO test-devbox mAP51.2iBOT (ViT-B/16)
Object DetectionCOCO test-devbox mAP49.4iBOT (ViT-S/16)
Image ClassificationImageNetARI32.8iBOT (ViT-S/16)
Image ClassificationImageNetAccuracy (%)43.4iBOT (ViT-S/16)
3DCOCO test-devbox mAP51.2iBOT (ViT-B/16)
3DCOCO test-devbox mAP49.4iBOT (ViT-S/16)
Instance SegmentationCOCO test-devmask AP44.2iBOT (ViT-B/16)
Instance SegmentationCOCO test-devmask AP42.6iBOT (ViT-S/16)
2D ClassificationCOCO test-devbox mAP51.2iBOT (ViT-B/16)
2D ClassificationCOCO test-devbox mAP49.4iBOT (ViT-S/16)
2D Object DetectionCOCO test-devbox mAP51.2iBOT (ViT-B/16)
2D Object DetectionCOCO test-devbox mAP49.4iBOT (ViT-S/16)
10-shot image generationADE20KValidation mIoU50iBOT (ViT-B/16)
10-shot image generationADE20KValidation mIoU45.4iBOT (ViT-S/16)
10-shot image generationADE20KValidation mIoU38.3iBOT (ViT-B/16) (linear head)
16kCOCO test-devbox mAP51.2iBOT (ViT-B/16)
16kCOCO test-devbox mAP49.4iBOT (ViT-S/16)

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17Federated Learning for Commercial Image Sources2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17