Vision Transformers Need Registers

Timothée Darcet, Maxime Oquab, Julien Mairal, Piotr Bojanowski

2023-09-28Self-Supervised Image Classification Object Discovery

Paper PDF Code Code Code Code(official)Code Code

Abstract

Transformers have recently emerged as a powerful tool for learning visual representations. In this paper, we identify and characterize artifacts in feature maps of both supervised and self-supervised ViT networks. The artifacts correspond to high-norm tokens appearing during inference primarily in low-informative background areas of images, that are repurposed for internal computations. We propose a simple yet effective solution based on providing additional tokens to the input sequence of the Vision Transformer to fill that role. We show that this solution fixes that problem entirely for both supervised and self-supervised models, sets a new state of the art for self-supervised visual models on dense visual prediction tasks, enables object discovery methods with larger models, and most importantly leads to smoother feature maps and attention maps for downstream visual processing.

Results

Task	Dataset	Metric	Value	Model
Image Classification	ImageNet	Top 1 Accuracy	87.1	DINOv2+reg (ViT-g/14)

Related Papers

When Does Pruning Benefit Vision Representations?2025-07-02 FORLA:Federated Object-centric Representation Learning with Slot Attention2025-06-03 Binding threshold units with artificial oscillatory neurons2025-05-06 Hierarchical Compact Clustering Attention (COCA) for Unsupervised Object-Centric Learning2025-05-04 Are We Done with Object-Centric Learning?2025-04-09 CTRL-O: Language-Controllable Object-Centric Visual Representation Learning2025-03-27 xMOD: Cross-Modal Distillation for 2D/3D Multi-Object Discovery from 2D motion2025-03-19 OV-SCAN: Semantically Consistent Alignment for Novel Object Discovery in Open-Vocabulary 3D Object Detection2025-03-09