TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/DINO

DINO

self-DIstillation with NO labels

Computer VisionIntroduced 2000208 papers
Source Paper

Description

DINO (self-distillation with no labels) is a self-supervised learning method that directly predicts the output of a teacher network - built with a momentum encoder - using a standard cross-entropy loss.

In the example to the right, DINO is illustrated in the case of one single pair of views (x_1,x_2)\left(x\_{1}, x\_{2}\right)(x_1,x_2) for simplicity. The model passes two different random transformations of an input image to the student and teacher networks. Both networks have the same architecture but other parameters. The output of the teacher network is centered with a mean computed over the batch. Each network outputs a KKK dimensional feature normalized with a temperature softmax over the feature dimension. Their similarity is then measured with a cross-entropy loss. A stop-gradient (sg) operator is applied to the teacher to propagate gradients only through the student. The teacher parameters are updated with the student parameters' exponential moving average (ema).

Papers Using This Method

Feed-Forward SceneDINO for Unsupervised Semantic Scene Completion2025-07-08GroundingDINO-US-SAM: Text-Prompted Multi-Organ Segmentation in Ultrasound with LoRA-Tuned Vision-Language Models2025-06-30Rethinking Random Masking in Self Distillation on ViT2025-06-12Textile Analysis for Recycling Automation using Transfer Learning and Zero-Shot Foundation Models2025-06-06Talk2SAM: Text-Guided Semantic Enhancement for Complex-Shaped Object Segmentation2025-06-03Attacking Attention of Foundation Models Disrupts Downstream Tasks2025-06-03DINO-R1: Incentivizing Reasoning Capability in Vision Foundation Models2025-05-29UP-SLAM: Adaptively Structured Gaussian SLAM with Uncertainty Prediction in Dynamic Environments2025-05-28Regularized Personalization of Text-to-Image Diffusion Models without Distributional Drift2025-05-26Unleashing Diffusion Transformers for Visual Correspondence by Modulating Massive Activations2025-05-24SSPS: Self-Supervised Positive Sampling for Robust Self-Supervised Speaker Verification2025-05-20Guiding Diffusion with Deep Geometric Moments: Balancing Fidelity and Variation2025-05-18IMAGE-ALCHEMY: Advancing subject fidelity in personalised text-to-image generation2025-05-15BridgeIV: Bridging Customized Image and Video Generation through Test-Time Autoregressive Identity Propagation2025-05-11UniVLA: Learning to Act Anywhere with Task-centric Latent Actions2025-05-09DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception2025-05-07From Word to Sentence: A Large-Scale Multi-Instance Dataset for Open-Set Aerial Detection2025-05-06Pixel3DMM: Versatile Screen-Space Priors for Single-Image 3D Face Reconstruction2025-05-01Automated Measurement of Eczema Severity with Self-Supervised Learning2025-04-21PromptHMR: Promptable Human Mesh Recovery2025-04-08