Tasks SotA Datasets Papers Methods Submit About

Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable Benchmarks All SotA Datasets Papers Methods

Community

Submit Results About

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

DINO

self-DIstillation with NO labels

Computer VisionIntroduced 2000208 papers

Description

DINO (self-distillation with no labels) is a self-supervised learning method that directly predicts the output of a teacher network - built with a momentum encoder - using a standard cross-entropy loss.

In the example to the right, DINO is illustrated in the case of one single pair of views $\left(x\_{1}, x\_{2}\right)$ for simplicity. The model passes two different random transformations of an input image to the student and teacher networks. Both networks have the same architecture but other parameters. The output of the teacher network is centered with a mean computed over the batch. Each network outputs a $K$ dimensional feature normalized with a temperature softmax over the feature dimension. Their similarity is then measured with a cross-entropy loss. A stop-gradient (sg) operator is applied to the teacher to propagate gradients only through the student. The teacher parameters are updated with the student parameters' exponential moving average (ema).

Papers Using This Method

Feed-Forward SceneDINO for Unsupervised Semantic Scene Completion2025-07-08 GroundingDINO-US-SAM: Text-Prompted Multi-Organ Segmentation in Ultrasound with LoRA-Tuned Vision-Language Models2025-06-30 Rethinking Random Masking in Self Distillation on ViT2025-06-12 Textile Analysis for Recycling Automation using Transfer Learning and Zero-Shot Foundation Models2025-06-06 Talk2SAM: Text-Guided Semantic Enhancement for Complex-Shaped Object Segmentation2025-06-03 Attacking Attention of Foundation Models Disrupts Downstream Tasks2025-06-03 DINO-R1: Incentivizing Reasoning Capability in Vision Foundation Models2025-05-29 UP-SLAM: Adaptively Structured Gaussian SLAM with Uncertainty Prediction in Dynamic Environments2025-05-28 Regularized Personalization of Text-to-Image Diffusion Models without Distributional Drift2025-05-26 Unleashing Diffusion Transformers for Visual Correspondence by Modulating Massive Activations2025-05-24 SSPS: Self-Supervised Positive Sampling for Robust Self-Supervised Speaker Verification2025-05-20 Guiding Diffusion with Deep Geometric Moments: Balancing Fidelity and Variation2025-05-18 IMAGE-ALCHEMY: Advancing subject fidelity in personalised text-to-image generation2025-05-15 BridgeIV: Bridging Customized Image and Video Generation through Test-Time Autoregressive Identity Propagation2025-05-11 UniVLA: Learning to Act Anywhere with Task-centric Latent Actions2025-05-09 DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception2025-05-07 From Word to Sentence: A Large-Scale Multi-Instance Dataset for Open-Set Aerial Detection2025-05-06 Pixel3DMM: Versatile Screen-Space Priors for Single-Image 3D Face Reconstruction2025-05-01 Automated Measurement of Eczema Severity with Self-Supervised Learning2025-04-21 PromptHMR: Promptable Human Mesh Recovery2025-04-08