Yangtao Wang, Xi Shen, Shell Hu, Yuan Yuan, James Crowley, Dominique Vaufreydaz
Transformers trained with self-supervised learning using self-distillation loss (DINO) have been shown to produce attention maps that highlight salient foreground objects. In this paper, we demonstrate a graph-based approach that uses the self-supervised transformer features to discover an object from an image. Visual tokens are viewed as nodes in a weighted graph with edges representing a connectivity score based on the similarity of tokens. Foreground objects can then be segmented using a normalized graph-cut to group self-similar regions. We solve the graph-cut problem using spectral clustering with generalized eigen-decomposition and show that the second smallest eigenvector provides a cutting solution since its absolute value indicates the likelihood that a token belongs to a foreground object. Despite its simplicity, this approach significantly boosts the performance of unsupervised object discovery: we improve over the recent state of the art LOST by a margin of 6.9%, 8.1%, and 8.1% respectively on the VOC07, VOC12, and COCO20K. The performance can be further improved by adding a second stage class-agnostic detector (CAD). Our proposed method can be easily extended to unsupervised saliency detection and weakly supervised object detection. For unsupervised saliency detection, we improve IoU for 4.9%, 5.2%, 12.9% on ECSSD, DUTS, DUT-OMRON respectively compared to previous state of the art. For weakly supervised object detection, we achieve competitive performance on CUB and ImageNet.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Saliency Detection | ECSSD | Accuracy | 93.4 | TokenCut |
| Saliency Detection | ECSSD | IoU | 77.2 | TokenCut |
| Saliency Detection | ECSSD | maximal F-measure | 87.4 | TokenCut |
| Saliency Detection | DUT-OMRON | Accuracy | 89.7 | TokenCut |
| Saliency Detection | DUT-OMRON | IoU | 61.8 | TokenCut |
| Saliency Detection | DUT-OMRON | maximal F-measure | 69.7 | TokenCut |
| Saliency Detection | DUTS | Accuracy | 91.4 | TokenCut |
| Saliency Detection | DUTS | IoU | 62.4 | TokenCut |
| Saliency Detection | DUTS | maximal F-measure | 75.5 | TokenCut |
| Object Localization | ImageNet | GT-known localization accuracy | 65.4 | TokenCut |
| Object Localization | ImageNet | Top-1 Localization Accuracy | 52.3 | TokenCut |
| Object Localization | CUB | Top-1 Localization Accuracy | 72.9 | TokenCut |
| Object Localization | CUB-200-2011 | Top-1 Localization Accuracy | 72.9 | TokenCut |
| Single-object discovery | COCO_20k | CorLoc | 62.6 | TokenCut + CAD |
| Single-object discovery | COCO_20k | CorLoc | 58.8 | TokenCut |