Tasks SotA Datasets Papers Methods Submit About

Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable Benchmarks All SotA Datasets Papers Methods

Community

Submit Results About

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Models/CLIPSelf

CLIPSelf

Reported on 16 benchmarks across 8 tasks · 1 paper

Note: results are matched by exact model name. Different papers may use the same name for different model variants.

Computer Vision8 results

Object DetectiononLVIS v1.0
AP novel-LVIS base training· 2023-10-02
34.9
best: 43.4 (LaMI-DETR)
CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction arXiv:2310.01403
Object DetectiononMSCOCO
AP 0.5· 2023-10-02
44.3
best: 50.3 (Cooperative Foundational Models)
CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction arXiv:2310.01403
Open Vocabulary Panoptic SegmentationonADE20K
PQ· 2023-10-02
23.7
best: 31.6 (UMG-CLIP-E/14)
CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction arXiv:2310.01403
Open Vocabulary Object DetectiononLVIS v1.0
AP novel-LVIS base training· 2023-10-02
34.9
best: 43.4 (LaMI-DETR)
CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction arXiv:2310.01403
Open Vocabulary Object DetectiononMSCOCO
AP 0.5· 2023-10-02
44.3
best: 50.3 (Cooperative Foundational Models)
CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction arXiv:2310.01403
Open Vocabulary Semantic SegmentationonADE20K-847
mIoU· 2023-10-02
12.4
best: 17.3 (UMG-CLIP-E/14)
CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction arXiv:2310.01403
Open Vocabulary Semantic SegmentationonPASCAL Context-59
mIoU· 2023-10-02
62.3
best: 64.6 (HyperSeg)
CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction arXiv:2310.01403
Open Vocabulary Semantic SegmentationonADE20K-150
mIoU· 2023-10-02
34.5
best: 38.2 (Mask-Adapter)
CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction arXiv:2310.01403

Methodology8 results

3DonLVIS v1.0
AP novel-LVIS base training· 2023-10-02
34.9
best: 43.4 (LaMI-DETR)
CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction arXiv:2310.01403
3DonMSCOCO
AP 0.5· 2023-10-02
44.3
best: 50.3 (Cooperative Foundational Models)
CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction arXiv:2310.01403
2D ClassificationonLVIS v1.0
AP novel-LVIS base training· 2023-10-02
34.9
best: 43.4 (LaMI-DETR)
CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction arXiv:2310.01403
2D ClassificationonMSCOCO
AP 0.5· 2023-10-02
44.3
best: 50.3 (Cooperative Foundational Models)
CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction arXiv:2310.01403
2D Object DetectiononLVIS v1.0
AP novel-LVIS base training· 2023-10-02
34.9
best: 43.4 (LaMI-DETR)
CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction arXiv:2310.01403
2D Object DetectiononMSCOCO
AP 0.5· 2023-10-02
44.3
best: 50.3 (Cooperative Foundational Models)
CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction arXiv:2310.01403
16konLVIS v1.0
AP novel-LVIS base training· 2023-10-02
34.9
best: 43.4 (LaMI-DETR)
CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction arXiv:2310.01403
16konMSCOCO
AP 0.5· 2023-10-02
44.3
best: 50.3 (Cooperative Foundational Models)
CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction arXiv:2310.01403