Lianghui Zhu, Junwei Zhou, Yan Liu, Xin Hao, Wenyu Liu, Xinggang Wang
Weakly supervised visual recognition using inexact supervision is a critical yet challenging learning problem. It significantly reduces human labeling costs and traditionally relies on multi-instance learning and pseudo-labeling. This paper introduces WeakSAM and solves the weakly-supervised object detection (WSOD) and segmentation by utilizing the pre-learned world knowledge contained in a vision foundation model, i.e., the Segment Anything Model (SAM). WeakSAM addresses two critical limitations in traditional WSOD retraining, i.e., pseudo ground truth (PGT) incompleteness and noisy PGT instances, through adaptive PGT generation and Region of Interest (RoI) drop regularization. It also addresses the SAM's problems of requiring prompts and category unawareness for automatic object detection and segmentation. Our results indicate that WeakSAM significantly surpasses previous state-of-the-art methods in WSOD and WSIS benchmarks with large margins, i.e. average improvements of 7.4% and 8.5%, respectively. The code is available at \url{https://github.com/hustvl/WeakSAM}.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Object Detection | MS-COCO-2014 | AP | 26.6 | WeakSAM-MIST-DINO (with SAM) |
| Object Detection | MS-COCO-2014 | AP | 24.9 | WeakSAM-OICR-DINO (with SAM) |
| Object Detection | MS-COCO-2014 | AP | 23.8 | WeakSAM-MIST-Faster RCNN (with SAM) |
| Object Detection | MS-COCO-2014 | AP | 22.9 | WeakSAM-MIST (with SAM) |
| Object Detection | MS-COCO-2014 | AP | 22.3 | WeakSAM-OICR-Faster RCNN (with SAM) |
| Object Detection | MS-COCO-2014 | AP | 19.9 | WeakSAM-OICR (with SAM) |
| Object Detection | PASCAL VOC 2007 | MAP | 73.4 | WeakSAM-MIST-DINO (with SAM) |
| Object Detection | PASCAL VOC 2007 | MAP | 71.8 | WeakSAM-MIST-Faster RCNN (with SAM) |
| Object Detection | PASCAL VOC 2007 | MAP | 67.4 | WeakSAM-MIST (with SAM) |
| Object Detection | PASCAL VOC 2007 | MAP | 66.1 | WeakSAM-OICR-DINO (with SAM) |
| Object Detection | PASCAL VOC 2007 | MAP | 65.7 | WeakSAM-OICR-Faster RCNN (with SAM) |
| Object Detection | PASCAL VOC 2007 | MAP | 58.9 | WeakSAM-OICR (with SAM) |
| Object Detection | PASCAL VOC 2012 test | MAP | 70.2 | WeakSAM-MIST-DINO (with SAM) |
| Object Detection | PASCAL VOC 2012 test | MAP | 69.2 | WeakSAM-MIST-Faster RCNN (with SAM) |
| Object Detection | PASCAL VOC 2012 test | MAP | 66.9 | WeakSAM-MIST (with SAM) |
| Object Detection | PASCAL VOC 2012 test | MAP | 63.7 | WeakSAM-OICR-DINO (with SAM) |
| Object Detection | PASCAL VOC 2012 test | MAP | 62.9 | WeakSAM-OICR-Faster RCNN (with SAM) |
| Object Detection | PASCAL VOC 2012 test | MAP | 58.4 | WeakSAM-OICR (with SAM) |
| 3D | MS-COCO-2014 | AP | 26.6 | WeakSAM-MIST-DINO (with SAM) |
| 3D | MS-COCO-2014 | AP | 24.9 | WeakSAM-OICR-DINO (with SAM) |
| 3D | MS-COCO-2014 | AP | 23.8 | WeakSAM-MIST-Faster RCNN (with SAM) |
| 3D | MS-COCO-2014 | AP | 22.9 | WeakSAM-MIST (with SAM) |
| 3D | MS-COCO-2014 | AP | 22.3 | WeakSAM-OICR-Faster RCNN (with SAM) |
| 3D | MS-COCO-2014 | AP | 19.9 | WeakSAM-OICR (with SAM) |
| 3D | PASCAL VOC 2007 | MAP | 73.4 | WeakSAM-MIST-DINO (with SAM) |
| 3D | PASCAL VOC 2007 | MAP | 71.8 | WeakSAM-MIST-Faster RCNN (with SAM) |
| 3D | PASCAL VOC 2007 | MAP | 67.4 | WeakSAM-MIST (with SAM) |
| 3D | PASCAL VOC 2007 | MAP | 66.1 | WeakSAM-OICR-DINO (with SAM) |
| 3D | PASCAL VOC 2007 | MAP | 65.7 | WeakSAM-OICR-Faster RCNN (with SAM) |
| 3D | PASCAL VOC 2007 | MAP | 58.9 | WeakSAM-OICR (with SAM) |
| 3D | PASCAL VOC 2012 test | MAP | 70.2 | WeakSAM-MIST-DINO (with SAM) |
| 3D | PASCAL VOC 2012 test | MAP | 69.2 | WeakSAM-MIST-Faster RCNN (with SAM) |
| 3D | PASCAL VOC 2012 test | MAP | 66.9 | WeakSAM-MIST (with SAM) |
| 3D | PASCAL VOC 2012 test | MAP | 63.7 | WeakSAM-OICR-DINO (with SAM) |
| 3D | PASCAL VOC 2012 test | MAP | 62.9 | WeakSAM-OICR-Faster RCNN (with SAM) |
| 3D | PASCAL VOC 2012 test | MAP | 58.4 | WeakSAM-OICR (with SAM) |
| Instance Segmentation | PASCAL VOC 2012 val | mAP@0.25 | 73.4 | WeakSAM-Mask2Former (with SAM) |
| Instance Segmentation | PASCAL VOC 2012 val | mAP@0.5 | 64.4 | WeakSAM-Mask2Former (with SAM) |
| Instance Segmentation | PASCAL VOC 2012 val | mAP@0.7 | 49.7 | WeakSAM-Mask2Former (with SAM) |
| Instance Segmentation | PASCAL VOC 2012 val | mAP@0.75 | 45.3 | WeakSAM-Mask2Former (with SAM) |
| Instance Segmentation | PASCAL VOC 2012 val | mAP@0.25 | 70.3 | WeakSAM-Mask RCNN (with SAM) |
| Instance Segmentation | PASCAL VOC 2012 val | mAP@0.5 | 59.6 | WeakSAM-Mask RCNN (with SAM) |
| Instance Segmentation | PASCAL VOC 2012 val | mAP@0.7 | 43.1 | WeakSAM-Mask RCNN (with SAM) |
| Instance Segmentation | PASCAL VOC 2012 val | mAP@0.75 | 36.2 | WeakSAM-Mask RCNN (with SAM) |
| Instance Segmentation | COCO 2017 val | AP | 25.2 | WeakSAM-Mask2Former (with SAM) |
| Instance Segmentation | COCO 2017 val | AP@50 | 38.4 | WeakSAM-Mask2Former (with SAM) |
| Instance Segmentation | COCO 2017 val | AP@75 | 27 | WeakSAM-Mask2Former (with SAM) |
| Instance Segmentation | COCO 2017 val | AP | 20.6 | WeakSAM-Mask RCNN (with SAM) |
| Instance Segmentation | COCO 2017 val | AP@50 | 33.9 | WeakSAM-Mask RCNN (with SAM) |
| Instance Segmentation | COCO 2017 val | AP@75 | 22 | WeakSAM-Mask RCNN (with SAM) |
| Instance Segmentation | COCO test-dev | AP | 25.9 | WeakSAM-Mask2Former (with SAM) |
| Instance Segmentation | COCO test-dev | AP@50 | 39.9 | WeakSAM-Mask2Former (with SAM) |
| Instance Segmentation | COCO test-dev | AP@75 | 27.9 | WeakSAM-Mask2Former (with SAM) |
| Instance Segmentation | COCO test-dev | AP | 21 | WeakSAM-Mask RCNN (with SAM) |
| Instance Segmentation | COCO test-dev | AP@50 | 34.5 | WeakSAM-Mask RCNN (with SAM) |
| Instance Segmentation | COCO test-dev | AP@75 | 22.2 | WeakSAM-Mask RCNN (with SAM) |
| 2D Classification | MS-COCO-2014 | AP | 26.6 | WeakSAM-MIST-DINO (with SAM) |
| 2D Classification | MS-COCO-2014 | AP | 24.9 | WeakSAM-OICR-DINO (with SAM) |
| 2D Classification | MS-COCO-2014 | AP | 23.8 | WeakSAM-MIST-Faster RCNN (with SAM) |
| 2D Classification | MS-COCO-2014 | AP | 22.9 | WeakSAM-MIST (with SAM) |
| 2D Classification | MS-COCO-2014 | AP | 22.3 | WeakSAM-OICR-Faster RCNN (with SAM) |
| 2D Classification | MS-COCO-2014 | AP | 19.9 | WeakSAM-OICR (with SAM) |
| 2D Classification | PASCAL VOC 2007 | MAP | 73.4 | WeakSAM-MIST-DINO (with SAM) |
| 2D Classification | PASCAL VOC 2007 | MAP | 71.8 | WeakSAM-MIST-Faster RCNN (with SAM) |
| 2D Classification | PASCAL VOC 2007 | MAP | 67.4 | WeakSAM-MIST (with SAM) |
| 2D Classification | PASCAL VOC 2007 | MAP | 66.1 | WeakSAM-OICR-DINO (with SAM) |
| 2D Classification | PASCAL VOC 2007 | MAP | 65.7 | WeakSAM-OICR-Faster RCNN (with SAM) |
| 2D Classification | PASCAL VOC 2007 | MAP | 58.9 | WeakSAM-OICR (with SAM) |
| 2D Classification | PASCAL VOC 2012 test | MAP | 70.2 | WeakSAM-MIST-DINO (with SAM) |
| 2D Classification | PASCAL VOC 2012 test | MAP | 69.2 | WeakSAM-MIST-Faster RCNN (with SAM) |
| 2D Classification | PASCAL VOC 2012 test | MAP | 66.9 | WeakSAM-MIST (with SAM) |
| 2D Classification | PASCAL VOC 2012 test | MAP | 63.7 | WeakSAM-OICR-DINO (with SAM) |
| 2D Classification | PASCAL VOC 2012 test | MAP | 62.9 | WeakSAM-OICR-Faster RCNN (with SAM) |
| 2D Classification | PASCAL VOC 2012 test | MAP | 58.4 | WeakSAM-OICR (with SAM) |
| 2D Object Detection | MS-COCO-2014 | AP | 26.6 | WeakSAM-MIST-DINO (with SAM) |
| 2D Object Detection | MS-COCO-2014 | AP | 24.9 | WeakSAM-OICR-DINO (with SAM) |
| 2D Object Detection | MS-COCO-2014 | AP | 23.8 | WeakSAM-MIST-Faster RCNN (with SAM) |
| 2D Object Detection | MS-COCO-2014 | AP | 22.9 | WeakSAM-MIST (with SAM) |
| 2D Object Detection | MS-COCO-2014 | AP | 22.3 | WeakSAM-OICR-Faster RCNN (with SAM) |
| 2D Object Detection | MS-COCO-2014 | AP | 19.9 | WeakSAM-OICR (with SAM) |
| 2D Object Detection | PASCAL VOC 2007 | MAP | 73.4 | WeakSAM-MIST-DINO (with SAM) |
| 2D Object Detection | PASCAL VOC 2007 | MAP | 71.8 | WeakSAM-MIST-Faster RCNN (with SAM) |
| 2D Object Detection | PASCAL VOC 2007 | MAP | 67.4 | WeakSAM-MIST (with SAM) |
| 2D Object Detection | PASCAL VOC 2007 | MAP | 66.1 | WeakSAM-OICR-DINO (with SAM) |
| 2D Object Detection | PASCAL VOC 2007 | MAP | 65.7 | WeakSAM-OICR-Faster RCNN (with SAM) |
| 2D Object Detection | PASCAL VOC 2007 | MAP | 58.9 | WeakSAM-OICR (with SAM) |
| 2D Object Detection | PASCAL VOC 2012 test | MAP | 70.2 | WeakSAM-MIST-DINO (with SAM) |
| 2D Object Detection | PASCAL VOC 2012 test | MAP | 69.2 | WeakSAM-MIST-Faster RCNN (with SAM) |
| 2D Object Detection | PASCAL VOC 2012 test | MAP | 66.9 | WeakSAM-MIST (with SAM) |
| 2D Object Detection | PASCAL VOC 2012 test | MAP | 63.7 | WeakSAM-OICR-DINO (with SAM) |
| 2D Object Detection | PASCAL VOC 2012 test | MAP | 62.9 | WeakSAM-OICR-Faster RCNN (with SAM) |
| 2D Object Detection | PASCAL VOC 2012 test | MAP | 58.4 | WeakSAM-OICR (with SAM) |
| 16k | MS-COCO-2014 | AP | 26.6 | WeakSAM-MIST-DINO (with SAM) |
| 16k | MS-COCO-2014 | AP | 24.9 | WeakSAM-OICR-DINO (with SAM) |
| 16k | MS-COCO-2014 | AP | 23.8 | WeakSAM-MIST-Faster RCNN (with SAM) |
| 16k | MS-COCO-2014 | AP | 22.9 | WeakSAM-MIST (with SAM) |
| 16k | MS-COCO-2014 | AP | 22.3 | WeakSAM-OICR-Faster RCNN (with SAM) |
| 16k | MS-COCO-2014 | AP | 19.9 | WeakSAM-OICR (with SAM) |
| 16k | PASCAL VOC 2007 | MAP | 73.4 | WeakSAM-MIST-DINO (with SAM) |
| 16k | PASCAL VOC 2007 | MAP | 71.8 | WeakSAM-MIST-Faster RCNN (with SAM) |
| 16k | PASCAL VOC 2007 | MAP | 67.4 | WeakSAM-MIST (with SAM) |
| 16k | PASCAL VOC 2007 | MAP | 66.1 | WeakSAM-OICR-DINO (with SAM) |
| 16k | PASCAL VOC 2007 | MAP | 65.7 | WeakSAM-OICR-Faster RCNN (with SAM) |
| 16k | PASCAL VOC 2007 | MAP | 58.9 | WeakSAM-OICR (with SAM) |
| 16k | PASCAL VOC 2012 test | MAP | 70.2 | WeakSAM-MIST-DINO (with SAM) |
| 16k | PASCAL VOC 2012 test | MAP | 69.2 | WeakSAM-MIST-Faster RCNN (with SAM) |
| 16k | PASCAL VOC 2012 test | MAP | 66.9 | WeakSAM-MIST (with SAM) |
| 16k | PASCAL VOC 2012 test | MAP | 63.7 | WeakSAM-OICR-DINO (with SAM) |
| 16k | PASCAL VOC 2012 test | MAP | 62.9 | WeakSAM-OICR-Faster RCNN (with SAM) |
| 16k | PASCAL VOC 2012 test | MAP | 58.4 | WeakSAM-OICR (with SAM) |