TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Models/CLIP

CLIP

Reported on 177 benchmarks across 15 tasks · 8 papers · 79 SOTA

Note: results are matched by exact model name. Different papers may use the same name for different model variants.

Computer Vision120 results

  • VideoonMAD
    R@1,IoU=0.1· 2021-12-01
    6.57
    best: 17.3 (ReVisionLLM)
    SOTA
    MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio DescriptionsarXiv:2112.00431
  • VideoonMAD
    R@1,IoU=0.3· 2021-12-01
    3.13
    best: 12.7 (ReVisionLLM)
    SOTA
    MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio DescriptionsarXiv:2112.00431
  • VideoonMAD
    R@10,IoU=0.1· 2021-12-01
    20.26
    best: 41.44 (DenoiseLoc)
    SOTA
    MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio DescriptionsarXiv:2112.00431
  • VideoonMAD
    R@5,IoU=0.1· 2021-12-01
    15.05
    best: 30.35 (DenoiseLoc)
    SOTA
    MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio DescriptionsarXiv:2112.00431
  • VideoonMAD
    R@5,IoU=0.3· 2021-12-01
    9.85
    best: 23.68 (DeCafNet)
    SOTA
    MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio DescriptionsarXiv:2112.00431
  • Image ClassificationonObjectNet
    Top-1 Accuracy· uses extra data· 2021-02-26
    72.3
    best: 82.7 (CoCa)
    SOTA
    Learning Transferable Visual Models From Natural Language SupervisionarXiv:2103.00020
  • Zero-Shot Transfer Image ClassificationonImageNet-A
    Accuracy (Private)· 2021-02-26
    77.2
    best: 90.2 (CoCa)
    SOTA
    Learning Transferable Visual Models From Natural Language SupervisionarXiv:2103.00020
  • Zero-Shot Transfer Image ClassificationonImageNet
    Accuracy (Public)· 2021-02-26
    31.3
    best: 76.5 (CWCL)
    SOTA
    Learning Transferable Visual Models From Natural Language SupervisionarXiv:2103.00020
  • Zero-Shot Transfer Image ClassificationonSUN
    Accuracy· 2021-02-26
    58.5
    best: 77.7 (EVA-CLIP-18B)
    SOTA
    Learning Transferable Visual Models From Natural Language SupervisionarXiv:2103.00020
  • Zero-Shot Transfer Image ClassificationonObjectNet
    Accuracy (Private)· 2021-02-26
    72.3
    best: 87.6 (LiT-22B)
    SOTA
    Learning Transferable Visual Models From Natural Language SupervisionarXiv:2103.00020
  • Zero-Shot Transfer Image ClassificationonaYahoo
    Accuracy· 2021-02-26
    98.4
    SOTA
    Learning Transferable Visual Models From Natural Language SupervisionarXiv:2103.00020
  • Object CategorizationonGRIT
    Categorization (ablation)· 2021-02-26
    48.1
    best: 61.7 (Unified-IOXL)
    SOTA
    Learning Transferable Visual Models From Natural Language SupervisionarXiv:2103.00020
  • VideoonMSR-VTT-1kA
    text-to-video R@1· uses extra data· 2021-02-24
    31.2
    best: 62.9 (HunYuan_tvr (huge))
    SOTA
    A Straightforward Framework For Video Retrieval Using CLIParXiv:2102.12443
  • VideoonMSR-VTT-1kA
    video-to-text Median Rank· uses extra data· 2021-02-24
    5
    SOTA
    A Straightforward Framework For Video Retrieval Using CLIParXiv:2102.12443
  • VideoonMSR-VTT-1kA
    video-to-text R@1· uses extra data· 2021-02-24
    27.2
    best: 64.8 (HunYuan_tvr (huge))
    SOTA
    A Straightforward Framework For Video Retrieval Using CLIParXiv:2102.12443
  • VideoonMSR-VTT-1kA
    video-to-text R@10· uses extra data· 2021-02-24
    62.6
    best: 91.1 (HunYuan_tvr (huge))
    SOTA
    A Straightforward Framework For Video Retrieval Using CLIParXiv:2102.12443
  • VideoonMSR-VTT-1kA
    video-to-text R@5· uses extra data· 2021-02-24
    51.7
    best: 783 (PIDRo)
    SOTA
    A Straightforward Framework For Video Retrieval Using CLIParXiv:2102.12443
  • VideoonMSR-VTT
    text-to-video R@1· 2021-02-24
    21.4
    best: 64 (GRAM)
    SOTA
    A Straightforward Framework For Video Retrieval Using CLIParXiv:2102.12443
  • VideoonMSR-VTT
    video-to-text R@1· 2021-02-24
    40.3
    best: 64.8 (GRAM)
    SOTA
    A Straightforward Framework For Video Retrieval Using CLIParXiv:2102.12443
  • VideoonMSR-VTT
    video-to-text R@10· 2021-02-24
    79.2
    best: 92.8 (CAMoE)
    SOTA
    A Straightforward Framework For Video Retrieval Using CLIParXiv:2102.12443
  • VideoonMSR-VTT
    video-to-text R@5· 2021-02-24
    69.7
    best: 86.2 (CAMoE)
    SOTA
    A Straightforward Framework For Video Retrieval Using CLIParXiv:2102.12443
  • VideoonLSMDC
    text-to-video Median Rank· 2021-02-24
    56.5
    SOTA
    A Straightforward Framework For Video Retrieval Using CLIParXiv:2102.12443
  • VideoonLSMDC
    video-to-text Median Rank· 2021-02-24
    73
    SOTA
    A Straightforward Framework For Video Retrieval Using CLIParXiv:2102.12443
  • VideoonLSMDC
    video-to-text R@1· 2021-02-24
    6.8
    best: 46.7 (InternVideo2-6B)
    SOTA
    A Straightforward Framework For Video Retrieval Using CLIParXiv:2102.12443
  • VideoonLSMDC
    video-to-text R@10· 2021-02-24
    22.1
    best: 91.8 (HunYuan_tvr (huge))
    SOTA
    A Straightforward Framework For Video Retrieval Using CLIParXiv:2102.12443
  • VideoonLSMDC
    video-to-text R@5· 2021-02-24
    16.4
    best: 71.8 (HunYuan_tvr (huge))
    SOTA
    A Straightforward Framework For Video Retrieval Using CLIParXiv:2102.12443
  • VideoonMSVD
    text-to-video R@1· 2021-02-24
    37
    best: 61.4 (InternVideo2-6B)
    SOTA
    A Straightforward Framework For Video Retrieval Using CLIParXiv:2102.12443
  • VideoonMSVD
    text-to-video R@10· 2021-02-24
    73.8
    best: 90.3 (HunYuan_tvr (huge))
    SOTA
    A Straightforward Framework For Video Retrieval Using CLIParXiv:2102.12443
  • VideoonMSVD
    text-to-video R@5· 2021-02-24
    64.1
    best: 87.6 (CAMoE)
    SOTA
    A Straightforward Framework For Video Retrieval Using CLIParXiv:2102.12443
  • VideoonMSVD
    video-to-text Median Rank· 2021-02-24
    1
    SOTA
    A Straightforward Framework For Video Retrieval Using CLIParXiv:2102.12443
  • VideoonMSVD
    video-to-text R@1· 2021-02-24
    59.9
    best: 85.2 (InternVideo2-6B)
    SOTA
    A Straightforward Framework For Video Retrieval Using CLIParXiv:2102.12443
  • VideoonMSVD
    video-to-text R@10· 2021-02-24
    90.7
    best: 97.1 (PAU)
    SOTA
    A Straightforward Framework For Video Retrieval Using CLIParXiv:2102.12443
  • VideoonMSVD
    video-to-text R@5· 2021-02-24
    85.2
    best: 94.5 (HunYuan_tvr (huge))
    SOTA
    A Straightforward Framework For Video Retrieval Using CLIParXiv:2102.12443
  • Image RetrievalonConQA Conceptual
    R-precision· 2021-02-24
    6.8
    SOTA
    A Straightforward Framework For Video Retrieval Using CLIParXiv:2102.12443
  • Video RetrievalonMSR-VTT-1kA
    text-to-video R@1· uses extra data· 2021-02-24
    31.2
    best: 62.9 (HunYuan_tvr (huge))
    SOTA
    A Straightforward Framework For Video Retrieval Using CLIParXiv:2102.12443
  • Video RetrievalonMSR-VTT-1kA
    video-to-text Median Rank· uses extra data· 2021-02-24
    5
    SOTA
    A Straightforward Framework For Video Retrieval Using CLIParXiv:2102.12443
  • Video RetrievalonMSR-VTT-1kA
    video-to-text R@1· uses extra data· 2021-02-24
    27.2
    best: 64.8 (HunYuan_tvr (huge))
    SOTA
    A Straightforward Framework For Video Retrieval Using CLIParXiv:2102.12443
  • Video RetrievalonMSR-VTT-1kA
    video-to-text R@10· uses extra data· 2021-02-24
    62.6
    best: 91.1 (HunYuan_tvr (huge))
    SOTA
    A Straightforward Framework For Video Retrieval Using CLIParXiv:2102.12443
  • Video RetrievalonMSR-VTT-1kA
    video-to-text R@5· uses extra data· 2021-02-24
    51.7
    best: 783 (PIDRo)
    SOTA
    A Straightforward Framework For Video Retrieval Using CLIParXiv:2102.12443
  • Video RetrievalonMSR-VTT
    text-to-video R@1· 2021-02-24
    21.4
    best: 64 (GRAM)
    SOTA
    A Straightforward Framework For Video Retrieval Using CLIParXiv:2102.12443
  • Video RetrievalonMSR-VTT
    video-to-text R@1· 2021-02-24
    40.3
    best: 64.8 (GRAM)
    SOTA
    A Straightforward Framework For Video Retrieval Using CLIParXiv:2102.12443
  • Video RetrievalonMSR-VTT
    video-to-text R@10· 2021-02-24
    79.2
    best: 92.8 (CAMoE)
    SOTA
    A Straightforward Framework For Video Retrieval Using CLIParXiv:2102.12443
  • Video RetrievalonMSR-VTT
    video-to-text R@5· 2021-02-24
    69.7
    best: 86.2 (CAMoE)
    SOTA
    A Straightforward Framework For Video Retrieval Using CLIParXiv:2102.12443
  • Video RetrievalonLSMDC
    text-to-video Median Rank· 2021-02-24
    56.5
    SOTA
    A Straightforward Framework For Video Retrieval Using CLIParXiv:2102.12443
  • Video RetrievalonLSMDC
    video-to-text Median Rank· 2021-02-24
    73
    SOTA
    A Straightforward Framework For Video Retrieval Using CLIParXiv:2102.12443
  • Video RetrievalonLSMDC
    video-to-text R@1· 2021-02-24
    6.8
    best: 46.7 (InternVideo2-6B)
    SOTA
    A Straightforward Framework For Video Retrieval Using CLIParXiv:2102.12443
  • Video RetrievalonLSMDC
    video-to-text R@10· 2021-02-24
    22.1
    best: 91.8 (HunYuan_tvr (huge))
    SOTA
    A Straightforward Framework For Video Retrieval Using CLIParXiv:2102.12443
  • Video RetrievalonLSMDC
    video-to-text R@5· 2021-02-24
    16.4
    best: 71.8 (HunYuan_tvr (huge))
    SOTA
    A Straightforward Framework For Video Retrieval Using CLIParXiv:2102.12443
  • Video RetrievalonMSVD
    text-to-video R@1· 2021-02-24
    37
    best: 61.4 (InternVideo2-6B)
    SOTA
    A Straightforward Framework For Video Retrieval Using CLIParXiv:2102.12443
  • Video RetrievalonMSVD
    text-to-video R@10· 2021-02-24
    73.8
    best: 90.3 (HunYuan_tvr (huge))
    SOTA
    A Straightforward Framework For Video Retrieval Using CLIParXiv:2102.12443
  • Video RetrievalonMSVD
    text-to-video R@5· 2021-02-24
    64.1
    best: 87.6 (CAMoE)
    SOTA
    A Straightforward Framework For Video Retrieval Using CLIParXiv:2102.12443
  • Video RetrievalonMSVD
    video-to-text Median Rank· 2021-02-24
    1
    SOTA
    A Straightforward Framework For Video Retrieval Using CLIParXiv:2102.12443
  • Video RetrievalonMSVD
    video-to-text R@1· 2021-02-24
    59.9
    best: 85.2 (InternVideo2-6B)
    SOTA
    A Straightforward Framework For Video Retrieval Using CLIParXiv:2102.12443
  • Video RetrievalonMSVD
    video-to-text R@10· 2021-02-24
    90.7
    best: 97.1 (PAU)
    SOTA
    A Straightforward Framework For Video Retrieval Using CLIParXiv:2102.12443
  • Video RetrievalonMSVD
    video-to-text R@5· 2021-02-24
    85.2
    best: 94.5 (HunYuan_tvr (huge))
    SOTA
    A Straightforward Framework For Video Retrieval Using CLIParXiv:2102.12443
  • VideoonActivityNet Adverbs
    Acc-A· 2023-09-26
    55.1
    best: 58.4 (ReGaDa)
    Video-adverb retrieval with compositional adverb-action embeddingsarXiv:2309.15086
  • VideoonMSR-VTT Adverbs
    Acc-A· 2023-09-26
    57
    best: 61 (ReGaDa)
    Video-adverb retrieval with compositional adverb-action embeddingsarXiv:2309.15086
  • VideoonVATEX Adverbs
    Acc-A· 2023-09-26
    54.5
    best: 61.7 (ReGaDa)
    Video-adverb retrieval with compositional adverb-action embeddingsarXiv:2309.15086
  • Video RetrievalonActivityNet Adverbs
    Acc-A· 2023-09-26
    55.1
    best: 58.4 (ReGaDa)
    Video-adverb retrieval with compositional adverb-action embeddingsarXiv:2309.15086
  • Video RetrievalonMSR-VTT Adverbs
    Acc-A· 2023-09-26
    57
    best: 61 (ReGaDa)
    Video-adverb retrieval with compositional adverb-action embeddingsarXiv:2309.15086
  • Video RetrievalonVATEX Adverbs
    Acc-A· 2023-09-26
    54.5
    best: 61.7 (ReGaDa)
    Video-adverb retrieval with compositional adverb-action embeddingsarXiv:2309.15086
  • Video-Adverb RetrievalonActivityNet Adverbs
    Acc-A· 2023-09-26
    55.1
    best: 58.4 (ReGaDa)
    Video-adverb retrieval with compositional adverb-action embeddingsarXiv:2309.15086
  • Video-Adverb RetrievalonMSR-VTT Adverbs
    Acc-A· 2023-09-26
    57
    best: 61 (ReGaDa)
    Video-adverb retrieval with compositional adverb-action embeddingsarXiv:2309.15086
  • Video-Adverb RetrievalonVATEX Adverbs
    Acc-A· 2023-09-26
    54.5
    best: 61.7 (ReGaDa)
    Video-adverb retrieval with compositional adverb-action embeddingsarXiv:2309.15086
  • Visual Place RecognitiononNardo-Air R
    Recall@1· 2023-08-01
    61.97
    best: 94.37 (AnyLoc-VLAD-DINO)
    AnyLoc: Towards Universal Visual Place RecognitionarXiv:2308.00688
  • Visual Place RecognitiononOxford RobotCar Dataset
    Recall@1· 2023-08-01
    34.55
    best: 98.95 (AnyLoc-VLAD-DINOv2)
    AnyLoc: Towards Universal Visual Place RecognitionarXiv:2308.00688
  • Visual Place RecognitiononNardo-Air
    Recall@1· 2023-08-01
    42.25
    best: 76.06 (AnyLoc-VLAD-DINOv2)
    AnyLoc: Towards Universal Visual Place RecognitionarXiv:2308.00688
  • Visual Place RecognitiononMid-Atlantic Ridge
    Recall@1· 2023-08-01
    25.74
    best: 34.65 (AnyLoc-VLAD-DINOv2)
    AnyLoc: Towards Universal Visual Place RecognitionarXiv:2308.00688
  • Visual Place RecognitiononSt Lucia
    Recall@1· 2023-08-01
    62.7
    best: 100 (EffoVPR)
    AnyLoc: Towards Universal Visual Place RecognitionarXiv:2308.00688
  • Visual Place RecognitiononHawkins
    Recall@1· 2023-08-01
    33.05
    best: 65.25 (AnyLoc-VLAD-DINOv2)
    AnyLoc: Towards Universal Visual Place RecognitionarXiv:2308.00688
  • Visual Place RecognitiononLaurel Caverns
    Recall@1· 2023-08-01
    36.61
    best: 61.61 (AnyLoc-VLAD-DINOv2)
    AnyLoc: Towards Universal Visual Place RecognitionarXiv:2308.00688
  • Visual Place RecognitiononGardens Point
    Recall@1· 2023-08-01
    42.5
    best: 95.5 (AnyLoc-VLAD-DINOv2)
    AnyLoc: Towards Universal Visual Place RecognitionarXiv:2308.00688
  • Visual Place RecognitiononPittsburgh-30k-test
    Recall@1· 2023-08-01
    54.97
    best: 95.4 (Pair-VPR-p)
    AnyLoc: Towards Universal Visual Place RecognitionarXiv:2308.00688
  • Visual Place RecognitiononVP-Air
    Recall@1· 2023-08-01
    36.59
    best: 66.74 (AnyLoc-VLAD-DINOv2)
    AnyLoc: Towards Universal Visual Place RecognitionarXiv:2308.00688
  • Visual Place Recognitionon17 Places
    Recall@1· 2023-08-01
    59.36
    best: 95.3 (SegVLAD-FineT (M))
    AnyLoc: Towards Universal Visual Place RecognitionarXiv:2308.00688
  • Visual Place RecognitiononBaidu Mall
    Recall@1· 2023-08-01
    56.02
    best: 80.4 (SegVLAD-PreT (M))
    AnyLoc: Towards Universal Visual Place RecognitionarXiv:2308.00688
  • Image RetrievalonMSCOCO
    Recall@1· 2023-01-11
    37.02
    best: 58.46 (HADA)
    HADA: A Graph-based Amalgamation Framework in Image-text RetrievalarXiv:2301.04742
  • Image RetrievalonMSCOCO
    Recall@10· 2023-01-11
    71.5
    best: 89.66 (HADA)
    HADA: A Graph-based Amalgamation Framework in Image-text RetrievalarXiv:2301.04742
  • Image RetrievalonMSCOCO
    Recall@5· 2023-01-11
    61.66
    best: 82.85 (HADA)
    HADA: A Graph-based Amalgamation Framework in Image-text RetrievalarXiv:2301.04742
  • VideoonMAD
    R@1,IoU=0.5· 2021-12-01
    1.39
    best: 7.06 (DeCafNet)
    MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio DescriptionsarXiv:2112.00431
  • VideoonMAD
    R@10,IoU=0.3· 2021-12-01
    14.13
    best: 19.86 (VLG-Net + Guidance Model)
    MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio DescriptionsarXiv:2112.00431
  • VideoonMAD
    R@10,IoU=0.5· 2021-12-01
    8.38
    best: 13.72 (VLG-Net + Guidance Model)
    MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio DescriptionsarXiv:2112.00431
  • VideoonMAD
    R@100,IoU=0.1· 2021-12-01
    47.73
    best: 73.62 (DenoiseLoc)
    MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio DescriptionsarXiv:2112.00431
  • VideoonMAD
    R@100,IoU=0.3· 2021-12-01
    36.98
    best: 49.38 (VLG-Net + Guidance Model)
    MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio DescriptionsarXiv:2112.00431
  • VideoonMAD
    R@100,IoU=0.5· 2021-12-01
    24.99
    best: 39.12 (VLG-Net + Guidance Model)
    MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio DescriptionsarXiv:2112.00431
  • VideoonMAD
    R@5,IoU=0.5· 2021-12-01
    5.44
    best: 16.13 (DeCafNet)
    MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio DescriptionsarXiv:2112.00431
  • VideoonMAD
    R@50,IoU=0.1· 2021-12-01
    37.92
    best: 66.07 (DenoiseLoc)
    MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio DescriptionsarXiv:2112.00431
  • VideoonMAD
    R@50,IoU=0.3· 2021-12-01
    28.71
    best: 39.77 (VLG-Net + Guidance Model)
    MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio DescriptionsarXiv:2112.00431
  • VideoonMAD
    R@50,IoU=0.5· 2021-12-01
    18.8
    best: 30.22 (VLG-Net + Guidance Model)
    MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio DescriptionsarXiv:2112.00431
  • Zero-Shot Transfer Image ClassificationonImageNet V2
    Accuracy (Private)· 2021-02-26
    70.1
    best: 81.2 (BASIC (Lion))
    Learning Transferable Visual Models From Natural Language SupervisionarXiv:2103.00020
  • Zero-Shot Transfer Image ClassificationonImageNet-R
    Accuracy· 2021-02-26
    88.9
    best: 96.8 (BASIC (Lion))
    Learning Transferable Visual Models From Natural Language SupervisionarXiv:2103.00020
  • VideoonMSR-VTT-1kA
    text-to-video Median Rank· uses extra data· 2021-02-24
    4
    best: 13 (JSFusion)
    A Straightforward Framework For Video Retrieval Using CLIParXiv:2102.12443
  • VideoonMSR-VTT-1kA
    text-to-video R@10· uses extra data· 2021-02-24
    64.2
    best: 90.8 (HunYuan_tvr (huge))
    A Straightforward Framework For Video Retrieval Using CLIParXiv:2102.12443
  • VideoonMSR-VTT-1kA
    text-to-video R@5· uses extra data· 2021-02-24
    53.7
    best: 84.5 (HunYuan_tvr (huge))
    A Straightforward Framework For Video Retrieval Using CLIParXiv:2102.12443
  • VideoonMSR-VTT
    text-to-video Median Rank· 2021-02-24
    10
    best: 55 (C+LSTM+SA+FC7)
    A Straightforward Framework For Video Retrieval Using CLIParXiv:2102.12443
  • VideoonMSR-VTT
    text-to-video R@10· 2021-02-24
    50.4
    best: 89.6 (VAST)
    A Straightforward Framework For Video Retrieval Using CLIParXiv:2102.12443
  • VideoonMSR-VTT
    text-to-video R@5· 2021-02-24
    41.1
    best: 84.3 (VAST)
    A Straightforward Framework For Video Retrieval Using CLIParXiv:2102.12443
  • VideoonMSR-VTT
    video-to-text Median Rank· 2021-02-24
    2
    best: 16 (JEMC)
    A Straightforward Framework For Video Retrieval Using CLIParXiv:2102.12443
  • VideoonLSMDC
    text-to-video R@1· 2021-02-24
    11.3
    best: 46.4 (InternVideo2-6B)
    A Straightforward Framework For Video Retrieval Using CLIParXiv:2102.12443
  • VideoonLSMDC
    text-to-video R@10· 2021-02-24
    29.2
    best: 92.8 (HunYuan_tvr (huge))
    A Straightforward Framework For Video Retrieval Using CLIParXiv:2102.12443
  • VideoonLSMDC
    text-to-video R@5· 2021-02-24
    22.7
    best: 80.1 (HunYuan_tvr (huge))
    A Straightforward Framework For Video Retrieval Using CLIParXiv:2102.12443
  • VideoonMSVD
    text-to-video Median Rank· 2021-02-24
    3
    best: 6 (SSML)
    A Straightforward Framework For Video Retrieval Using CLIParXiv:2102.12443
  • Video RetrievalonMSR-VTT-1kA
    text-to-video Median Rank· uses extra data· 2021-02-24
    4
    best: 13 (JSFusion)
    A Straightforward Framework For Video Retrieval Using CLIParXiv:2102.12443
  • Video RetrievalonMSR-VTT-1kA
    text-to-video R@10· uses extra data· 2021-02-24
    64.2
    best: 90.8 (HunYuan_tvr (huge))
    A Straightforward Framework For Video Retrieval Using CLIParXiv:2102.12443
  • Video RetrievalonMSR-VTT-1kA
    text-to-video R@5· uses extra data· 2021-02-24
    53.7
    best: 84.5 (HunYuan_tvr (huge))
    A Straightforward Framework For Video Retrieval Using CLIParXiv:2102.12443
  • Video RetrievalonMSR-VTT
    text-to-video Median Rank· 2021-02-24
    10
    best: 55 (C+LSTM+SA+FC7)
    A Straightforward Framework For Video Retrieval Using CLIParXiv:2102.12443
  • Video RetrievalonMSR-VTT
    text-to-video R@10· 2021-02-24
    50.4
    best: 89.6 (VAST)
    A Straightforward Framework For Video Retrieval Using CLIParXiv:2102.12443
  • Video RetrievalonMSR-VTT
    text-to-video R@5· 2021-02-24
    41.1
    best: 84.3 (VAST)
    A Straightforward Framework For Video Retrieval Using CLIParXiv:2102.12443
  • Video RetrievalonMSR-VTT
    video-to-text Median Rank· 2021-02-24
    2
    best: 16 (JEMC)
    A Straightforward Framework For Video Retrieval Using CLIParXiv:2102.12443
  • Video RetrievalonLSMDC
    text-to-video R@1· 2021-02-24
    11.3
    best: 46.4 (InternVideo2-6B)
    A Straightforward Framework For Video Retrieval Using CLIParXiv:2102.12443
  • Video RetrievalonLSMDC
    text-to-video R@10· 2021-02-24
    29.2
    best: 92.8 (HunYuan_tvr (huge))
    A Straightforward Framework For Video Retrieval Using CLIParXiv:2102.12443
  • Video RetrievalonLSMDC
    text-to-video R@5· 2021-02-24
    22.7
    best: 80.1 (HunYuan_tvr (huge))
    A Straightforward Framework For Video Retrieval Using CLIParXiv:2102.12443
  • Video RetrievalonMSVD
    text-to-video Median Rank· 2021-02-24
    3
    best: 6 (SSML)
    A Straightforward Framework For Video Retrieval Using CLIParXiv:2102.12443
  • Image RetrievalonConQA Conceptual
    Recall@1
    12.2
  • Image RetrievalonConQA Conceptual
    Recall@10
    36.7
    best: 40.8 (BLIP)
  • Image RetrievalonConQA Conceptual
    Recall@5
    30.6
  • Image RetrievalonConQA Descriptive
    R-precision
    16.5
  • Image RetrievalonConQA Descriptive
    Recall@1
    20.7
  • Image RetrievalonConQA Descriptive
    Recall@10
    65.5
  • Image RetrievalonConQA Descriptive
    Recall@5
    58.3

Natural Language Processing30 results

  • Multimodal Text and Image ClassificationonVALSE foil-it (noun phrases)
    pairwise accuracy· 2021-12-14
    88.8
    SOTA
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE action replacement
    pairwise accuracy· 2021-12-14
    75.6
    SOTA
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Meme ClassificationonMultiOFF
    Accuracy· 2021-02-26
    62.4
    best: 71.1 (RA-HMD (Qwen2-VL-7B))
    SOTA
    Learning Transferable Visual Models From Natural Language SupervisionarXiv:2103.00020
  • Meme ClassificationonMultiOFF
    F1· 2021-02-26
    48.1
    best: 64.8 (RA-HMD (Qwen2-VL-7B))
    SOTA
    Learning Transferable Visual Models From Natural Language SupervisionarXiv:2103.00020
  • Meme ClassificationonHarm-P
    Accuracy· 2021-02-26
    80.6
    best: 91.6 (RA-HMD (Qwen2-VL-7B))
    SOTA
    Learning Transferable Visual Models From Natural Language SupervisionarXiv:2103.00020
  • Meme ClassificationonHarm-P
    F1· 2021-02-26
    80.3
    best: 91.1 (RA-HMD (Qwen2-VL-7B))
    SOTA
    Learning Transferable Visual Models From Natural Language SupervisionarXiv:2103.00020
  • Prompt EngineeringonImageNet-R
    Top-1 accuracy %· 2021-02-26
    73.96
    best: 77.9 (POMP)
    SOTA
    Learning Transferable Visual Models From Natural Language SupervisionarXiv:2103.00020
  • Prompt EngineeringonStanford Cars
    Harmonic mean· 2021-02-26
    68.65
    best: 83.13 (PromptKD)
    SOTA
    Learning Transferable Visual Models From Natural Language SupervisionarXiv:2103.00020
  • Prompt EngineeringonOxford 102 Flower
    Harmonic mean· 2021-02-26
    74.83
    best: 90.24 (PromptKD)
    SOTA
    Learning Transferable Visual Models From Natural Language SupervisionarXiv:2103.00020
  • Prompt EngineeringonEuroSAT
    Harmonic mean· 2021-02-26
    60.03
    best: 91.94 (MMRL++)
    SOTA
    Learning Transferable Visual Models From Natural Language SupervisionarXiv:2103.00020
  • Prompt EngineeringonOxford-IIIT Pet Dataset
    Harmonic mean· 2021-02-26
    94.12
    best: 97.15 (PromptKD)
    SOTA
    Learning Transferable Visual Models From Natural Language SupervisionarXiv:2103.00020
  • Prompt EngineeringonImageNet-S
    Top-1 accuracy %· 2021-02-26
    46.15
    best: 49.8 (POMP)
    SOTA
    Learning Transferable Visual Models From Natural Language SupervisionarXiv:2103.00020
  • Prompt EngineeringonDTD
    Harmonic mean· 2021-02-26
    56.37
    best: 77.94 (PromptKD)
    SOTA
    Learning Transferable Visual Models From Natural Language SupervisionarXiv:2103.00020
  • Prompt EngineeringonUCF101
    Harmonic mean· 2021-02-26
    73.85
    best: 86.1 (PromptKD)
    SOTA
    Learning Transferable Visual Models From Natural Language SupervisionarXiv:2103.00020
  • Prompt EngineeringonCaltech-101
    Harmonic mean· 2021-02-26
    95.4
    best: 97.77 (PromptKD)
    SOTA
    Learning Transferable Visual Models From Natural Language SupervisionarXiv:2103.00020
  • Prompt EngineeringonImageNet
    Harmonic mean· 2021-02-26
    70.22
    best: 77.62 (PromptKD)
    SOTA
    Learning Transferable Visual Models From Natural Language SupervisionarXiv:2103.00020
  • Prompt EngineeringonFGVC-Aircraft
    Harmonic mean· 2021-02-26
    31.09
    best: 45.17 (PromptKD)
    SOTA
    Learning Transferable Visual Models From Natural Language SupervisionarXiv:2103.00020
  • Prompt EngineeringonSUN397
    Harmonic mean· 2021-02-26
    72.23
    best: 82.6 (PromptKD)
    SOTA
    Learning Transferable Visual Models From Natural Language SupervisionarXiv:2103.00020
  • Prompt EngineeringonImageNet-A
    Top-1 accuracy %· 2021-02-26
    47.77
    best: 51.6 (POMP)
    SOTA
    Learning Transferable Visual Models From Natural Language SupervisionarXiv:2103.00020
  • Prompt EngineeringonImageNet V2
    Top-1 accuracy %· 2021-02-26
    60.83
    best: 65.31 (HPT++)
    SOTA
    Learning Transferable Visual Models From Natural Language SupervisionarXiv:2103.00020
  • Multimodal Text and Image ClassificationonVALSE counting adversarial
    pairwise accuracy· 2021-12-14
    57.5
    best: 77.3 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE counting balanced
    pairwise accuracy· 2021-12-14
    62.1
    best: 76.7 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE actant swap
    pairwise accuracy· 2021-12-14
    68.6
    best: 76.9 (GPT2)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE coreference clean
    pairwise accuracy· 2021-12-14
    49.7
    best: 69.2 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE counting small numbers
    pairwise accuracy· 2021-12-14
    62.5
    best: 80.2 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE existence
    pairwise accuracy· 2021-12-14
    66.9
    best: 95.6 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE coreference standard
    pairwise accuracy· 2021-12-14
    52.1
    best: 75.7 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE spatial relations
    pairwise accuracy· 2021-12-14
    64.3
    best: 77.2 (GPT1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE plurality
    pairwise accuracy· 2021-12-14
    56.2
    best: 72.4 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE
    average pairwise accuracy· 2021-12-14
    64
    best: 75.1 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566

Methodology12 results

  • Multimodal Deep LearningonVALSE foil-it (noun phrases)
    pairwise accuracy· 2021-12-14
    88.8
    SOTA
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE action replacement
    pairwise accuracy· 2021-12-14
    75.6
    SOTA
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE counting adversarial
    pairwise accuracy· 2021-12-14
    57.5
    best: 77.3 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE counting balanced
    pairwise accuracy· 2021-12-14
    62.1
    best: 76.7 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE actant swap
    pairwise accuracy· 2021-12-14
    68.6
    best: 76.9 (GPT2)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE coreference clean
    pairwise accuracy· 2021-12-14
    49.7
    best: 69.2 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE counting small numbers
    pairwise accuracy· 2021-12-14
    62.5
    best: 80.2 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE existence
    pairwise accuracy· 2021-12-14
    66.9
    best: 95.6 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE coreference standard
    pairwise accuracy· 2021-12-14
    52.1
    best: 75.7 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE spatial relations
    pairwise accuracy· 2021-12-14
    64.3
    best: 77.2 (GPT1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE plurality
    pairwise accuracy· 2021-12-14
    56.2
    best: 72.4 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE
    average pairwise accuracy· 2021-12-14
    64
    best: 75.1 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566

Miscellaneous12 results

  • Image Retrieval with Multi-Modal QueryonFlickr30k
    Image-to-text R@1· 2021-02-26
    88
    best: 98.8 (X2-VLM (large))
    Learning Transferable Visual Models From Natural Language SupervisionarXiv:2103.00020
  • Image Retrieval with Multi-Modal QueryonFlickr30k
    Image-to-text R@10· 2021-02-26
    99.4
    best: 100 (X2-VLM (large))
    Learning Transferable Visual Models From Natural Language SupervisionarXiv:2103.00020
  • Image Retrieval with Multi-Modal QueryonFlickr30k
    Image-to-text R@5· 2021-02-26
    98.7
    best: 100 (X2-VLM (large))
    Learning Transferable Visual Models From Natural Language SupervisionarXiv:2103.00020
  • Image Retrieval with Multi-Modal QueryonFlickr30k
    Text-to-image R@1· 2021-02-26
    68.7
    best: 93.3 (ERNIE-ViL 2.0)
    Learning Transferable Visual Models From Natural Language SupervisionarXiv:2103.00020
  • Image Retrieval with Multi-Modal QueryonFlickr30k
    Text-to-image R@10· 2021-02-26
    95.2
    best: 99.8 (ERNIE-ViL 2.0)
    Learning Transferable Visual Models From Natural Language SupervisionarXiv:2103.00020
  • Image Retrieval with Multi-Modal QueryonFlickr30k
    Text-to-image R@5· 2021-02-26
    90.6
    best: 99.5 (M2-Encoder)
    Learning Transferable Visual Models From Natural Language SupervisionarXiv:2103.00020
  • Image Retrieval with Multi-Modal QueryonCOCO 2014
    Image-to-text R@1· 2021-02-26
    58.4
    best: 84.8 (BEiT-3)
    Learning Transferable Visual Models From Natural Language SupervisionarXiv:2103.00020
  • Image Retrieval with Multi-Modal QueryonCOCO 2014
    Image-to-text R@10· 2021-02-26
    88.1
    best: 98.5 (X2-VLM (large))
    Learning Transferable Visual Models From Natural Language SupervisionarXiv:2103.00020
  • Image Retrieval with Multi-Modal QueryonCOCO 2014
    Image-to-text R@5· 2021-02-26
    81.5
    best: 96.5 (X2-VLM (large))
    Learning Transferable Visual Models From Natural Language SupervisionarXiv:2103.00020
  • Image Retrieval with Multi-Modal QueryonCOCO 2014
    Text-to-image R@1· 2021-02-26
    37.8
    best: 68 (VAST)
    Learning Transferable Visual Models From Natural Language SupervisionarXiv:2103.00020
  • Image Retrieval with Multi-Modal QueryonCOCO 2014
    Text-to-image R@10· 2021-02-26
    72.2
    best: 92.8 (VAST)
    Learning Transferable Visual Models From Natural Language SupervisionarXiv:2103.00020
  • Image Retrieval with Multi-Modal QueryonCOCO 2014
    Text-to-image R@5· 2021-02-26
    62.4
    best: 92.8 (BEiT-3)
    Learning Transferable Visual Models From Natural Language SupervisionarXiv:2103.00020

Robots2 results

  • Activity RecognitiononRareAct
    mWAP· 2021-02-26
    40.7
    best: 60.8 (🦩 Flamingo)
    SOTA
    Learning Transferable Visual Models From Natural Language SupervisionarXiv:2103.00020
  • Activity RecognitiononStanford40
    Top-3 Accuracy (%)· 2024-03-11
    6.49
    best: 10.47 (FocusCLIP)
    Human Pose Descriptions and Subject-Focused Attention for Improved Zero-Shot Transfer in Human-Centric Classification TasksarXiv:2403.06904

Time Series1 result

  • Action RecognitiononRareAct
    mWAP· 2021-02-26
    40.7
    best: 60.8 (🦩 Flamingo)
    SOTA
    Learning Transferable Visual Models From Natural Language SupervisionarXiv:2103.00020