iBOT: Image BERT Pre-Training with Online Tokenizer

Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, Tao Kong

2021-11-15Self-Supervised Image Classification Image Classification Masked Language Modeling Semantic Segmentation Instance Segmentation Unsupervised Image Classification Object Detection Language Modelling Semi-Supervised Image Classification

Paper PDF Code(official)Code

Abstract

The success of language Transformers is primarily attributed to the pretext task of masked language modeling (MLM), where texts are first tokenized into semantically meaningful pieces. In this work, we study masked image modeling (MIM) and indicate the advantages and challenges of using a semantically meaningful visual tokenizer. We present a self-supervised framework iBOT that can perform masked prediction with an online tokenizer. Specifically, we perform self-distillation on masked patch tokens and take the teacher network as the online tokenizer, along with self-distillation on the class token to acquire visual semantics. The online tokenizer is jointly learnable with the MIM objective and dispenses with a multi-stage training pipeline where the tokenizer needs to be pre-trained beforehand. We show the prominence of iBOT by achieving an 82.3% linear probing accuracy and an 87.8% fine-tuning accuracy evaluated on ImageNet-1K. Beyond the state-of-the-art image classification results, we underline emerging local semantic patterns, which helps the models to obtain strong robustness against common corruptions and achieve leading results on dense downstream tasks, eg., object detection, instance segmentation, and semantic segmentation.

Results

Task	Dataset	Metric	Value	Model
Semantic Segmentation	ADE20K	Validation mIoU	50	iBOT (ViT-B/16)
Semantic Segmentation	ADE20K	Validation mIoU	45.4	iBOT (ViT-S/16)
Semantic Segmentation	ADE20K	Validation mIoU	38.3	iBOT (ViT-B/16) (linear head)
Object Detection	COCO test-dev	box mAP	51.2	iBOT (ViT-B/16)
Object Detection	COCO test-dev	box mAP	49.4	iBOT (ViT-S/16)
Image Classification	ImageNet	ARI	32.8	iBOT (ViT-S/16)
Image Classification	ImageNet	Accuracy (%)	43.4	iBOT (ViT-S/16)
3D	COCO test-dev	box mAP	51.2	iBOT (ViT-B/16)
3D	COCO test-dev	box mAP	49.4	iBOT (ViT-S/16)
Instance Segmentation	COCO test-dev	mask AP	44.2	iBOT (ViT-B/16)
Instance Segmentation	COCO test-dev	mask AP	42.6	iBOT (ViT-S/16)
2D Classification	COCO test-dev	box mAP	51.2	iBOT (ViT-B/16)
2D Classification	COCO test-dev	box mAP	49.4	iBOT (ViT-S/16)
2D Object Detection	COCO test-dev	box mAP	51.2	iBOT (ViT-B/16)
2D Object Detection	COCO test-dev	box mAP	49.4	iBOT (ViT-S/16)
10-shot image generation	ADE20K	Validation mIoU	50	iBOT (ViT-B/16)
10-shot image generation	ADE20K	Validation mIoU	45.4	iBOT (ViT-S/16)
10-shot image generation	ADE20K	Validation mIoU	38.3	iBOT (ViT-B/16) (linear head)
16k	COCO test-dev	box mAP	51.2	iBOT (ViT-B/16)
16k	COCO test-dev	box mAP	49.4	iBOT (ViT-S/16)

iBOT: Image BERT Pre-Training with Online Tokenizer

Abstract

Results

Related Papers

iBOT: Image BERT Pre-Training with Online Tokenizer

Abstract

Results

Related Papers