Xinyu Zhang, YuHan Liu, Yuting Wang, Abdeslam Boularias
Few-shot object detection aims at detecting novel categories given only a few example images. It is a basic skill for a robot to perform tasks in open environments. Recent methods focus on finetuning strategies, with complicated procedures that prohibit a wider application. In this paper, we introduce DE-ViT, a few-shot object detector without the need for finetuning. DE-ViT's novel architecture is based on a new region-propagation mechanism for localization. The propagated region masks are transformed into bounding boxes through a learnable spatial integral layer. Instead of training prototype classifiers, we propose to use prototypes to project ViT features into a subspace that is robust to overfitting on base classes. We evaluate DE-ViT on few-shot, and one-shot object detection benchmarks with Pascal VOC, COCO, and LVIS. DE-ViT establishes new state-of-the-art results on all benchmarks. Notably, for COCO, DE-ViT surpasses the few-shot SoTA by 15 mAP on 10-shot and 7.2 mAP on 30-shot and one-shot SoTA by 2.8 AP50. For LVIS, DE-ViT outperforms few-shot SoTA by 17 box APr. Further, we evaluate DE-ViT with a real robot by building a pick-and-place system for sorting novel objects based on example images. The videos of our robot demonstrations, the source code and the models of DE-ViT can be found at https://mlzxy.github.io/devit.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Object Detection | MS-COCO (30-shot) | AP | 34 | DE-ViT |
| Object Detection | MS-COCO (10-shot) | AP | 34 | DE-ViT |
| Object Detection | Artaxor | mAP | 49.2 | DE-ViT-FT |
| Object Detection | Artaxor | mAP | 9.2 | DE-ViT(w/o FT) |
| Object Detection | NEU-DET | mAP | 8.8 | DE-ViT-FT |
| Object Detection | NEU-DET | mAP | 1.8 | DE-ViT(w/o FT) |
| Object Detection | DIOR | mAP | 25.6 | DE-ViT-FT |
| Object Detection | DIOR | mAP | 8.4 | DE-ViT(w/o FT) |
| Object Detection | Clipark1k | mAP | 40.8 | DE-ViT-FT |
| Object Detection | Clipark1k | mAP | 11 | DE-ViT(w/o FT) |
| Object Detection | DeepFish | mAP | 21.3 | DE-ViT-FT |
| Object Detection | DeepFish | mAP | 2.1 | DE-ViT(w/o FT) |
| Object Detection | UODD | mAP | 5.4 | DE-ViT-FT |
| Object Detection | UODD | mAP | 3.1 | DE-ViT(w/o FT) |
| Object Detection | LVIS v1.0 | AP novel-LVIS base training | 34.3 | DE-ViT |
| Object Detection | MSCOCO | AP 0.5 | 50 | DE-ViT |
| Object Detection | COCO (Common Objects in Context) | AP 0.5 | 28.4 | DE-ViT |
| 3D | MS-COCO (30-shot) | AP | 34 | DE-ViT |
| 3D | MS-COCO (10-shot) | AP | 34 | DE-ViT |
| 3D | Artaxor | mAP | 49.2 | DE-ViT-FT |
| 3D | Artaxor | mAP | 9.2 | DE-ViT(w/o FT) |
| 3D | NEU-DET | mAP | 8.8 | DE-ViT-FT |
| 3D | NEU-DET | mAP | 1.8 | DE-ViT(w/o FT) |
| 3D | DIOR | mAP | 25.6 | DE-ViT-FT |
| 3D | DIOR | mAP | 8.4 | DE-ViT(w/o FT) |
| 3D | Clipark1k | mAP | 40.8 | DE-ViT-FT |
| 3D | Clipark1k | mAP | 11 | DE-ViT(w/o FT) |
| 3D | DeepFish | mAP | 21.3 | DE-ViT-FT |
| 3D | DeepFish | mAP | 2.1 | DE-ViT(w/o FT) |
| 3D | UODD | mAP | 5.4 | DE-ViT-FT |
| 3D | UODD | mAP | 3.1 | DE-ViT(w/o FT) |
| 3D | LVIS v1.0 | AP novel-LVIS base training | 34.3 | DE-ViT |
| 3D | MSCOCO | AP 0.5 | 50 | DE-ViT |
| 3D | COCO (Common Objects in Context) | AP 0.5 | 28.4 | DE-ViT |
| Few-Shot Object Detection | MS-COCO (30-shot) | AP | 34 | DE-ViT |
| Few-Shot Object Detection | MS-COCO (10-shot) | AP | 34 | DE-ViT |
| Few-Shot Object Detection | Artaxor | mAP | 49.2 | DE-ViT-FT |
| Few-Shot Object Detection | Artaxor | mAP | 9.2 | DE-ViT(w/o FT) |
| Few-Shot Object Detection | NEU-DET | mAP | 8.8 | DE-ViT-FT |
| Few-Shot Object Detection | NEU-DET | mAP | 1.8 | DE-ViT(w/o FT) |
| Few-Shot Object Detection | DIOR | mAP | 25.6 | DE-ViT-FT |
| Few-Shot Object Detection | DIOR | mAP | 8.4 | DE-ViT(w/o FT) |
| Few-Shot Object Detection | Clipark1k | mAP | 40.8 | DE-ViT-FT |
| Few-Shot Object Detection | Clipark1k | mAP | 11 | DE-ViT(w/o FT) |
| Few-Shot Object Detection | DeepFish | mAP | 21.3 | DE-ViT-FT |
| Few-Shot Object Detection | DeepFish | mAP | 2.1 | DE-ViT(w/o FT) |
| Few-Shot Object Detection | UODD | mAP | 5.4 | DE-ViT-FT |
| Few-Shot Object Detection | UODD | mAP | 3.1 | DE-ViT(w/o FT) |
| 2D Classification | MS-COCO (30-shot) | AP | 34 | DE-ViT |
| 2D Classification | MS-COCO (10-shot) | AP | 34 | DE-ViT |
| 2D Classification | Artaxor | mAP | 49.2 | DE-ViT-FT |
| 2D Classification | Artaxor | mAP | 9.2 | DE-ViT(w/o FT) |
| 2D Classification | NEU-DET | mAP | 8.8 | DE-ViT-FT |
| 2D Classification | NEU-DET | mAP | 1.8 | DE-ViT(w/o FT) |
| 2D Classification | DIOR | mAP | 25.6 | DE-ViT-FT |
| 2D Classification | DIOR | mAP | 8.4 | DE-ViT(w/o FT) |
| 2D Classification | Clipark1k | mAP | 40.8 | DE-ViT-FT |
| 2D Classification | Clipark1k | mAP | 11 | DE-ViT(w/o FT) |
| 2D Classification | DeepFish | mAP | 21.3 | DE-ViT-FT |
| 2D Classification | DeepFish | mAP | 2.1 | DE-ViT(w/o FT) |
| 2D Classification | UODD | mAP | 5.4 | DE-ViT-FT |
| 2D Classification | UODD | mAP | 3.1 | DE-ViT(w/o FT) |
| 2D Classification | LVIS v1.0 | AP novel-LVIS base training | 34.3 | DE-ViT |
| 2D Classification | MSCOCO | AP 0.5 | 50 | DE-ViT |
| 2D Classification | COCO (Common Objects in Context) | AP 0.5 | 28.4 | DE-ViT |
| 2D Object Detection | MS-COCO (30-shot) | AP | 34 | DE-ViT |
| 2D Object Detection | MS-COCO (10-shot) | AP | 34 | DE-ViT |
| 2D Object Detection | Artaxor | mAP | 49.2 | DE-ViT-FT |
| 2D Object Detection | Artaxor | mAP | 9.2 | DE-ViT(w/o FT) |
| 2D Object Detection | NEU-DET | mAP | 8.8 | DE-ViT-FT |
| 2D Object Detection | NEU-DET | mAP | 1.8 | DE-ViT(w/o FT) |
| 2D Object Detection | DIOR | mAP | 25.6 | DE-ViT-FT |
| 2D Object Detection | DIOR | mAP | 8.4 | DE-ViT(w/o FT) |
| 2D Object Detection | Clipark1k | mAP | 40.8 | DE-ViT-FT |
| 2D Object Detection | Clipark1k | mAP | 11 | DE-ViT(w/o FT) |
| 2D Object Detection | DeepFish | mAP | 21.3 | DE-ViT-FT |
| 2D Object Detection | DeepFish | mAP | 2.1 | DE-ViT(w/o FT) |
| 2D Object Detection | UODD | mAP | 5.4 | DE-ViT-FT |
| 2D Object Detection | UODD | mAP | 3.1 | DE-ViT(w/o FT) |
| 2D Object Detection | LVIS v1.0 | AP novel-LVIS base training | 34.3 | DE-ViT |
| 2D Object Detection | MSCOCO | AP 0.5 | 50 | DE-ViT |
| 2D Object Detection | COCO (Common Objects in Context) | AP 0.5 | 28.4 | DE-ViT |
| Open Vocabulary Object Detection | LVIS v1.0 | AP novel-LVIS base training | 34.3 | DE-ViT |
| Open Vocabulary Object Detection | MSCOCO | AP 0.5 | 50 | DE-ViT |
| 16k | MS-COCO (30-shot) | AP | 34 | DE-ViT |
| 16k | MS-COCO (10-shot) | AP | 34 | DE-ViT |
| 16k | Artaxor | mAP | 49.2 | DE-ViT-FT |
| 16k | Artaxor | mAP | 9.2 | DE-ViT(w/o FT) |
| 16k | NEU-DET | mAP | 8.8 | DE-ViT-FT |
| 16k | NEU-DET | mAP | 1.8 | DE-ViT(w/o FT) |
| 16k | DIOR | mAP | 25.6 | DE-ViT-FT |
| 16k | DIOR | mAP | 8.4 | DE-ViT(w/o FT) |
| 16k | Clipark1k | mAP | 40.8 | DE-ViT-FT |
| 16k | Clipark1k | mAP | 11 | DE-ViT(w/o FT) |
| 16k | DeepFish | mAP | 21.3 | DE-ViT-FT |
| 16k | DeepFish | mAP | 2.1 | DE-ViT(w/o FT) |
| 16k | UODD | mAP | 5.4 | DE-ViT-FT |
| 16k | UODD | mAP | 3.1 | DE-ViT(w/o FT) |
| 16k | LVIS v1.0 | AP novel-LVIS base training | 34.3 | DE-ViT |
| 16k | MSCOCO | AP 0.5 | 50 | DE-ViT |
| 16k | COCO (Common Objects in Context) | AP 0.5 | 28.4 | DE-ViT |