TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/YOLO-World: Real-Time Open-Vocabulary Object Detection

YOLO-World: Real-Time Open-Vocabulary Object Detection

Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xinggang Wang, Ying Shan

2024-01-30CVPR 2024 1Zero-Shot Object DetectionSemantic SegmentationOpen Vocabulary Object DetectionInstance Segmentationobject-detectionObject DetectionLanguage Modelling
PaperPDFCodeCodeCode(official)

Abstract

The You Only Look Once (YOLO) series of detectors have established themselves as efficient and practical tools. However, their reliance on predefined and trained object categories limits their applicability in open scenarios. Addressing this limitation, we introduce YOLO-World, an innovative approach that enhances YOLO with open-vocabulary detection capabilities through vision-language modeling and pre-training on large-scale datasets. Specifically, we propose a new Re-parameterizable Vision-Language Path Aggregation Network (RepVL-PAN) and region-text contrastive loss to facilitate the interaction between visual and linguistic information. Our method excels in detecting a wide range of objects in a zero-shot manner with high efficiency. On the challenging LVIS dataset, YOLO-World achieves 35.4 AP with 52.0 FPS on V100, which outperforms many state-of-the-art methods in terms of both accuracy and speed. Furthermore, the fine-tuned YOLO-World achieves remarkable performance on several downstream tasks, including object detection and open-vocabulary instance segmentation.

Results

TaskDatasetMetricValueModel
Object DetectionLVIS v1.0 minivalAP35.4YOLO-World-L
Object DetectionMSCOCOAP45.1YOLO-World-L(without COCO data)
3DLVIS v1.0 minivalAP35.4YOLO-World-L
3DMSCOCOAP45.1YOLO-World-L(without COCO data)
2D ClassificationLVIS v1.0 minivalAP35.4YOLO-World-L
2D ClassificationMSCOCOAP45.1YOLO-World-L(without COCO data)
2D Object DetectionLVIS v1.0 minivalAP35.4YOLO-World-L
2D Object DetectionMSCOCOAP45.1YOLO-World-L(without COCO data)
16kLVIS v1.0 minivalAP35.4YOLO-World-L
16kMSCOCOAP45.1YOLO-World-L(without COCO data)

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17Unified Medical Image Segmentation with State Space Modeling Snake2025-07-17A Privacy-Preserving Semantic-Segmentation Method Using Domain-Adaptation Technique2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17RS-TinyNet: Stage-wise Feature Fusion Network for Detecting Tiny Objects in Remote Sensing Images2025-07-17