Detect Everything with Few Examples

Xinyu Zhang, YuHan Liu, Yuting Wang, Abdeslam Boularias

2023-09-22Few-Shot Object Detection Binary Classification Open Vocabulary Object Detection object-detection Cross-Domain Few-Shot Object Detection Object Detection One-Shot Object Detection

Paper PDF Code(official)

Abstract

Few-shot object detection aims at detecting novel categories given only a few example images. It is a basic skill for a robot to perform tasks in open environments. Recent methods focus on finetuning strategies, with complicated procedures that prohibit a wider application. In this paper, we introduce DE-ViT, a few-shot object detector without the need for finetuning. DE-ViT's novel architecture is based on a new region-propagation mechanism for localization. The propagated region masks are transformed into bounding boxes through a learnable spatial integral layer. Instead of training prototype classifiers, we propose to use prototypes to project ViT features into a subspace that is robust to overfitting on base classes. We evaluate DE-ViT on few-shot, and one-shot object detection benchmarks with Pascal VOC, COCO, and LVIS. DE-ViT establishes new state-of-the-art results on all benchmarks. Notably, for COCO, DE-ViT surpasses the few-shot SoTA by 15 mAP on 10-shot and 7.2 mAP on 30-shot and one-shot SoTA by 2.8 AP50. For LVIS, DE-ViT outperforms few-shot SoTA by 17 box APr. Further, we evaluate DE-ViT with a real robot by building a pick-and-place system for sorting novel objects based on example images. The videos of our robot demonstrations, the source code and the models of DE-ViT can be found at https://mlzxy.github.io/devit.

Results

Task	Dataset	Metric	Value	Model
Object Detection	MS-COCO (30-shot)	AP	34	DE-ViT
Object Detection	MS-COCO (10-shot)	AP	34	DE-ViT
Object Detection	Artaxor	mAP	49.2	DE-ViT-FT
Object Detection	Artaxor	mAP	9.2	DE-ViT(w/o FT)
Object Detection	NEU-DET	mAP	8.8	DE-ViT-FT
Object Detection	NEU-DET	mAP	1.8	DE-ViT(w/o FT)
Object Detection	DIOR	mAP	25.6	DE-ViT-FT
Object Detection	DIOR	mAP	8.4	DE-ViT(w/o FT)
Object Detection	Clipark1k	mAP	40.8	DE-ViT-FT
Object Detection	Clipark1k	mAP	11	DE-ViT(w/o FT)
Object Detection	DeepFish	mAP	21.3	DE-ViT-FT
Object Detection	DeepFish	mAP	2.1	DE-ViT(w/o FT)
Object Detection	UODD	mAP	5.4	DE-ViT-FT
Object Detection	UODD	mAP	3.1	DE-ViT(w/o FT)
Object Detection	LVIS v1.0	AP novel-LVIS base training	34.3	DE-ViT
Object Detection	MSCOCO	AP 0.5	50	DE-ViT
Object Detection	COCO (Common Objects in Context)	AP 0.5	28.4	DE-ViT
3D	MS-COCO (30-shot)	AP	34	DE-ViT
3D	MS-COCO (10-shot)	AP	34	DE-ViT
3D	Artaxor	mAP	49.2	DE-ViT-FT
3D	Artaxor	mAP	9.2	DE-ViT(w/o FT)
3D	NEU-DET	mAP	8.8	DE-ViT-FT
3D	NEU-DET	mAP	1.8	DE-ViT(w/o FT)
3D	DIOR	mAP	25.6	DE-ViT-FT
3D	DIOR	mAP	8.4	DE-ViT(w/o FT)
3D	Clipark1k	mAP	40.8	DE-ViT-FT
3D	Clipark1k	mAP	11	DE-ViT(w/o FT)
3D	DeepFish	mAP	21.3	DE-ViT-FT
3D	DeepFish	mAP	2.1	DE-ViT(w/o FT)
3D	UODD	mAP	5.4	DE-ViT-FT
3D	UODD	mAP	3.1	DE-ViT(w/o FT)
3D	LVIS v1.0	AP novel-LVIS base training	34.3	DE-ViT
3D	MSCOCO	AP 0.5	50	DE-ViT
3D	COCO (Common Objects in Context)	AP 0.5	28.4	DE-ViT
Few-Shot Object Detection	MS-COCO (30-shot)	AP	34	DE-ViT
Few-Shot Object Detection	MS-COCO (10-shot)	AP	34	DE-ViT
Few-Shot Object Detection	Artaxor	mAP	49.2	DE-ViT-FT
Few-Shot Object Detection	Artaxor	mAP	9.2	DE-ViT(w/o FT)
Few-Shot Object Detection	NEU-DET	mAP	8.8	DE-ViT-FT
Few-Shot Object Detection	NEU-DET	mAP	1.8	DE-ViT(w/o FT)
Few-Shot Object Detection	DIOR	mAP	25.6	DE-ViT-FT
Few-Shot Object Detection	DIOR	mAP	8.4	DE-ViT(w/o FT)
Few-Shot Object Detection	Clipark1k	mAP	40.8	DE-ViT-FT
Few-Shot Object Detection	Clipark1k	mAP	11	DE-ViT(w/o FT)
Few-Shot Object Detection	DeepFish	mAP	21.3	DE-ViT-FT
Few-Shot Object Detection	DeepFish	mAP	2.1	DE-ViT(w/o FT)
Few-Shot Object Detection	UODD	mAP	5.4	DE-ViT-FT
Few-Shot Object Detection	UODD	mAP	3.1	DE-ViT(w/o FT)
2D Classification	MS-COCO (30-shot)	AP	34	DE-ViT
2D Classification	MS-COCO (10-shot)	AP	34	DE-ViT
2D Classification	Artaxor	mAP	49.2	DE-ViT-FT
2D Classification	Artaxor	mAP	9.2	DE-ViT(w/o FT)
2D Classification	NEU-DET	mAP	8.8	DE-ViT-FT
2D Classification	NEU-DET	mAP	1.8	DE-ViT(w/o FT)
2D Classification	DIOR	mAP	25.6	DE-ViT-FT
2D Classification	DIOR	mAP	8.4	DE-ViT(w/o FT)
2D Classification	Clipark1k	mAP	40.8	DE-ViT-FT
2D Classification	Clipark1k	mAP	11	DE-ViT(w/o FT)
2D Classification	DeepFish	mAP	21.3	DE-ViT-FT
2D Classification	DeepFish	mAP	2.1	DE-ViT(w/o FT)
2D Classification	UODD	mAP	5.4	DE-ViT-FT
2D Classification	UODD	mAP	3.1	DE-ViT(w/o FT)
2D Classification	LVIS v1.0	AP novel-LVIS base training	34.3	DE-ViT
2D Classification	MSCOCO	AP 0.5	50	DE-ViT
2D Classification	COCO (Common Objects in Context)	AP 0.5	28.4	DE-ViT
2D Object Detection	MS-COCO (30-shot)	AP	34	DE-ViT
2D Object Detection	MS-COCO (10-shot)	AP	34	DE-ViT
2D Object Detection	Artaxor	mAP	49.2	DE-ViT-FT
2D Object Detection	Artaxor	mAP	9.2	DE-ViT(w/o FT)
2D Object Detection	NEU-DET	mAP	8.8	DE-ViT-FT
2D Object Detection	NEU-DET	mAP	1.8	DE-ViT(w/o FT)
2D Object Detection	DIOR	mAP	25.6	DE-ViT-FT
2D Object Detection	DIOR	mAP	8.4	DE-ViT(w/o FT)
2D Object Detection	Clipark1k	mAP	40.8	DE-ViT-FT
2D Object Detection	Clipark1k	mAP	11	DE-ViT(w/o FT)
2D Object Detection	DeepFish	mAP	21.3	DE-ViT-FT
2D Object Detection	DeepFish	mAP	2.1	DE-ViT(w/o FT)
2D Object Detection	UODD	mAP	5.4	DE-ViT-FT
2D Object Detection	UODD	mAP	3.1	DE-ViT(w/o FT)
2D Object Detection	LVIS v1.0	AP novel-LVIS base training	34.3	DE-ViT
2D Object Detection	MSCOCO	AP 0.5	50	DE-ViT
2D Object Detection	COCO (Common Objects in Context)	AP 0.5	28.4	DE-ViT
Open Vocabulary Object Detection	LVIS v1.0	AP novel-LVIS base training	34.3	DE-ViT
Open Vocabulary Object Detection	MSCOCO	AP 0.5	50	DE-ViT
16k	MS-COCO (30-shot)	AP	34	DE-ViT
16k	MS-COCO (10-shot)	AP	34	DE-ViT
16k	Artaxor	mAP	49.2	DE-ViT-FT
16k	Artaxor	mAP	9.2	DE-ViT(w/o FT)
16k	NEU-DET	mAP	8.8	DE-ViT-FT
16k	NEU-DET	mAP	1.8	DE-ViT(w/o FT)
16k	DIOR	mAP	25.6	DE-ViT-FT
16k	DIOR	mAP	8.4	DE-ViT(w/o FT)
16k	Clipark1k	mAP	40.8	DE-ViT-FT
16k	Clipark1k	mAP	11	DE-ViT(w/o FT)
16k	DeepFish	mAP	21.3	DE-ViT-FT
16k	DeepFish	mAP	2.1	DE-ViT(w/o FT)
16k	UODD	mAP	5.4	DE-ViT-FT
16k	UODD	mAP	3.1	DE-ViT(w/o FT)
16k	LVIS v1.0	AP novel-LVIS base training	34.3	DE-ViT
16k	MSCOCO	AP 0.5	50	DE-ViT
16k	COCO (Common Objects in Context)	AP 0.5	28.4	DE-ViT

Detect Everything with Few Examples

Abstract

Results

Related Papers

Detect Everything with Few Examples

Abstract

Results

Related Papers