YOLO-World: Real-Time Open-Vocabulary Object Detection

Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xinggang Wang, Ying Shan

2024-01-30CVPR 2024 1Zero-Shot Object Detection Semantic Segmentation Open Vocabulary Object Detection Instance Segmentation object-detection Object Detection Language Modelling

Paper PDF Code Code Code(official)

Abstract

The You Only Look Once (YOLO) series of detectors have established themselves as efficient and practical tools. However, their reliance on predefined and trained object categories limits their applicability in open scenarios. Addressing this limitation, we introduce YOLO-World, an innovative approach that enhances YOLO with open-vocabulary detection capabilities through vision-language modeling and pre-training on large-scale datasets. Specifically, we propose a new Re-parameterizable Vision-Language Path Aggregation Network (RepVL-PAN) and region-text contrastive loss to facilitate the interaction between visual and linguistic information. Our method excels in detecting a wide range of objects in a zero-shot manner with high efficiency. On the challenging LVIS dataset, YOLO-World achieves 35.4 AP with 52.0 FPS on V100, which outperforms many state-of-the-art methods in terms of both accuracy and speed. Furthermore, the fine-tuned YOLO-World achieves remarkable performance on several downstream tasks, including object detection and open-vocabulary instance segmentation.

Results

Task	Dataset	Metric	Value	Model
Object Detection	LVIS v1.0 minival	AP	35.4	YOLO-World-L
Object Detection	MSCOCO	AP	45.1	YOLO-World-L(without COCO data)
3D	LVIS v1.0 minival	AP	35.4	YOLO-World-L
3D	MSCOCO	AP	45.1	YOLO-World-L(without COCO data)
2D Classification	LVIS v1.0 minival	AP	35.4	YOLO-World-L
2D Classification	MSCOCO	AP	45.1	YOLO-World-L(without COCO data)
2D Object Detection	LVIS v1.0 minival	AP	35.4	YOLO-World-L
2D Object Detection	MSCOCO	AP	45.1	YOLO-World-L(without COCO data)
16k	LVIS v1.0 minival	AP	35.4	YOLO-World-L
16k	MSCOCO	AP	45.1	YOLO-World-L(without COCO data)

YOLO-World: Real-Time Open-Vocabulary Object Detection

Abstract

Results

Related Papers

YOLO-World: Real-Time Open-Vocabulary Object Detection

Abstract

Results

Related Papers