Maria A. Bravo, Sudhanshu Mittal, Thomas Brox
In this work, we propose an open-vocabulary object detection method that, based on image-caption pairs, learns to detect novel object classes along with a given set of known classes. It is a two-stage training approach that first uses a location-guided image-caption matching technique to learn class labels for both novel and known classes in a weakly-supervised manner and second specializes the model for the object detection task using known class annotations. We show that a simple language model fits better than a large contextualized language model for detecting novel objects. Moreover, we introduce a consistency-regularization technique to better exploit image-caption pair information. Our method compares favorably to existing open-vocabulary detection approaches while being data-efficient. Source code is available at https://github.com/lmb-freiburg/locov .
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Object Detection | MSCOCO | AP 0.5 | 28.6 | LocOv (RN50-C4) |
| Object Detection | OVAD benchmark | mean average precision | 14.9 | LocOv (ResNet50) |
| 3D | MSCOCO | AP 0.5 | 28.6 | LocOv (RN50-C4) |
| 3D | OVAD benchmark | mean average precision | 14.9 | LocOv (ResNet50) |
| 2D Classification | MSCOCO | AP 0.5 | 28.6 | LocOv (RN50-C4) |
| 2D Classification | OVAD benchmark | mean average precision | 14.9 | LocOv (ResNet50) |
| 2D Object Detection | MSCOCO | AP 0.5 | 28.6 | LocOv (RN50-C4) |
| 2D Object Detection | OVAD benchmark | mean average precision | 14.9 | LocOv (ResNet50) |
| Open Vocabulary Object Detection | MSCOCO | AP 0.5 | 28.6 | LocOv (RN50-C4) |
| Open Vocabulary Object Detection | OVAD benchmark | mean average precision | 14.9 | LocOv (ResNet50) |
| 16k | MSCOCO | AP 0.5 | 28.6 | LocOv (RN50-C4) |
| 16k | OVAD benchmark | mean average precision | 14.9 | LocOv (ResNet50) |