Shuhuai Ren, Aston Zhang, Yi Zhu, Shuai Zhang, Shuai Zheng, Mu Li, Alex Smola, Xu sun
This work proposes POMP, a prompt pre-training method for vision-language models. Being memory and computation efficient, POMP enables the learned prompt to condense semantic information for a rich set of visual concepts with over twenty-thousand classes. Once pre-trained, the prompt with a strong transferable ability can be directly plugged into a variety of visual recognition tasks including image classification, semantic segmentation, and object detection, to boost recognition performances in a zero-shot manner. Empirical evaluation shows that POMP achieves state-of-the-art performances on 21 datasets, e.g., 67.0% average accuracy on 10 classification datasets (+3.1% compared to CoOp) and 84.4 hIoU on open-vocabulary Pascal VOC segmentation (+6.9 compared to ZSSeg). Our code is available at https://github.com/amazon-science/prompt-pretraining.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Object Detection | LVIS v1.0 | AP novel-LVIS base training | 25.2 | POMP |
| 3D | LVIS v1.0 | AP novel-LVIS base training | 25.2 | POMP |
| 2D Classification | LVIS v1.0 | AP novel-LVIS base training | 25.2 | POMP |
| 2D Object Detection | LVIS v1.0 | AP novel-LVIS base training | 25.2 | POMP |
| Prompt Engineering | ImageNet-R | Top-1 accuracy % | 77.9 | POMP |
| Prompt Engineering | ImageNet-21k | Accuracy | 25.3 | POMP |
| Prompt Engineering | ImageNet-S | Top-1 accuracy % | 49.8 | POMP |
| Prompt Engineering | ImageNet-A | Top-1 accuracy % | 51.6 | POMP |
| Open Vocabulary Object Detection | LVIS v1.0 | AP novel-LVIS base training | 25.2 | POMP |
| Open Vocabulary Semantic Segmentation | COCO-Stuff-171 | HIoU | 39.1 | POMP |
| Open Vocabulary Semantic Segmentation | PascalVOC-20 | hIoU | 84.4 | POMP |
| Open Vocabulary Semantic Segmentation | PascalVOC-20 | mIoU | 89.4 | POMP |
| 16k | LVIS v1.0 | AP novel-LVIS base training | 25.2 | POMP |