TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Vision Transformers Need Registers

Vision Transformers Need Registers

Timothée Darcet, Maxime Oquab, Julien Mairal, Piotr Bojanowski

2023-09-28Self-Supervised Image ClassificationObject Discovery
PaperPDFCodeCodeCodeCode(official)CodeCode

Abstract

Transformers have recently emerged as a powerful tool for learning visual representations. In this paper, we identify and characterize artifacts in feature maps of both supervised and self-supervised ViT networks. The artifacts correspond to high-norm tokens appearing during inference primarily in low-informative background areas of images, that are repurposed for internal computations. We propose a simple yet effective solution based on providing additional tokens to the input sequence of the Vision Transformer to fill that role. We show that this solution fixes that problem entirely for both supervised and self-supervised models, sets a new state of the art for self-supervised visual models on dense visual prediction tasks, enables object discovery methods with larger models, and most importantly leads to smoother feature maps and attention maps for downstream visual processing.

Results

TaskDatasetMetricValueModel
Image ClassificationImageNetTop 1 Accuracy87.1DINOv2+reg (ViT-g/14)

Related Papers

When Does Pruning Benefit Vision Representations?2025-07-02FORLA:Federated Object-centric Representation Learning with Slot Attention2025-06-03Binding threshold units with artificial oscillatory neurons2025-05-06Hierarchical Compact Clustering Attention (COCA) for Unsupervised Object-Centric Learning2025-05-04Are We Done with Object-Centric Learning?2025-04-09CTRL-O: Language-Controllable Object-Centric Visual Representation Learning2025-03-27xMOD: Cross-Modal Distillation for 2D/3D Multi-Object Discovery from 2D motion2025-03-19OV-SCAN: Semantically Consistent Alignment for Novel Object Discovery in Open-Vocabulary 3D Object Detection2025-03-09