TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Models/VinVL

VinVL

Reported on 18 benchmarks across 5 tasks · 2 papers · 14 SOTA

Note: results are matched by exact model name. Different papers may use the same name for different model variants.

Natural Language Processing13 results

  • Image Captioningonnocaps-val-out-domain
    CIDEr· 2021-01-02
    88.3
    best: 124.8 (BLIP-2 ViT-G FlanT5 XL (zero-shot))
    SOTA
    VinVL: Revisiting Visual Representations in Vision-Language ModelsarXiv:2101.00529
  • Image Captioningonnocaps-val-out-domain
    SPICE· 2021-01-02
    12.1
    best: 15.1 (BLIP-2 ViT-G FlanT5 XL (zero-shot))
    SOTA
    VinVL: Revisiting Visual Representations in Vision-Language ModelsarXiv:2101.00529
  • Image Captioningonnocaps-val-near-domain
    CIDEr· 2021-01-02
    96.1
    best: 120.2 (BLIP-2 ViT-G FlanT5 XL (zero-shot))
    SOTA
    VinVL: Revisiting Visual Representations in Vision-Language ModelsarXiv:2101.00529
  • Image Captioningonnocaps-val-near-domain
    SPICE· 2021-01-02
    13.8
    best: 15.9 (BLIP-2 ViT-G FlanT5 XL (zero-shot))
    SOTA
    VinVL: Revisiting Visual Representations in Vision-Language ModelsarXiv:2101.00529
  • Image CaptioningonCOCO Captions
    CIDER· 2021-01-02
    140.9
    best: 155.1 (mPLUG)
    SOTA
    VinVL: Revisiting Visual Representations in Vision-Language ModelsarXiv:2101.00529
  • Image CaptioningonCOCO Captions
    METEOR· 2021-01-02
    31.1
    best: 33.9 (CoCa)
    SOTA
    VinVL: Revisiting Visual Representations in Vision-Language ModelsarXiv:2101.00529
  • Image CaptioningonCOCO Captions
    SPICE· 2021-01-02
    25.2
    best: 27 (VAST)
    SOTA
    VinVL: Revisiting Visual Representations in Vision-Language ModelsarXiv:2101.00529
  • Image Captioningonnocaps-val-overall
    CIDEr· 2021-01-02
    95.5
    best: 121.6 (BLIP-2 ViT-G FlanT5 XL (zero-shot))
    SOTA
    VinVL: Revisiting Visual Representations in Vision-Language ModelsarXiv:2101.00529
  • Image Captioningonnocaps-val-overall
    SPICE· 2021-01-02
    13.5
    best: 15.8 (BLIP-2 ViT-G FlanT5 XL (zero-shot))
    SOTA
    VinVL: Revisiting Visual Representations in Vision-Language ModelsarXiv:2101.00529
  • Image Captioningonnocaps-val-in-domain
    CIDEr· 2021-01-02
    103.1
    best: 123.7 (BLIP-2 ViT-G FlanT5 XL (zero-shot))
    SOTA
    VinVL: Revisiting Visual Representations in Vision-Language ModelsarXiv:2101.00529
  • Image Captioningonnocaps-val-in-domain
    SPICE· 2021-01-02
    14.2
    best: 16.3 (BLIP-2 ViT-G FlanT5 XL (zero-shot))
    SOTA
    VinVL: Revisiting Visual Representations in Vision-Language ModelsarXiv:2101.00529
  • Cross-Modal RetrievalonCommercialAdsDataset
    ADD(S) AUC· 2021-01-02
    88.56
    best: 91.73 (AlignCMSS)
    SOTA
    VinVL: Revisiting Visual Representations in Vision-Language ModelsarXiv:2101.00529
  • Image CaptioningonCOCO Captions
    BLEU-4· 2021-01-02
    41
    best: 46.5 (mPLUG)
    VinVL: Revisiting Visual Representations in Vision-Language ModelsarXiv:2101.00529

Reasoning3 results

  • Visual ReasoningonWinoground
    Group Score· 2022-04-07
    14.5
    best: 58.75 (GPT-4V (CoT, pick b/w two options))
    Winoground: Probing Vision and Language Models for Visio-Linguistic CompositionalityarXiv:2204.03162
  • Visual ReasoningonWinoground
    Image Score· 2022-04-07
    17.75
    best: 68.75 (GPT-4V (CoT, pick b/w two options))
    Winoground: Probing Vision and Language Models for Visio-Linguistic CompositionalityarXiv:2204.03162
  • Visual ReasoningonWinoground
    Text Score· 2022-04-07
    37.75
    best: 75.5 (GPT-4o + CA)
    Winoground: Probing Vision and Language Models for Visio-Linguistic CompositionalityarXiv:2204.03162

Miscellaneous2 results

  • Image Retrieval with Multi-Modal QueryonCommercialAdsDataset
    ADD(S) AUC· 2021-01-02
    88.56
    best: 91.73 (AlignCMSS)
    SOTA
    VinVL: Revisiting Visual Representations in Vision-Language ModelsarXiv:2101.00529
  • Cross-Modal Information RetrievalonCommercialAdsDataset
    ADD(S) AUC· 2021-01-02
    88.56
    best: 91.73 (AlignCMSS)
    SOTA
    VinVL: Revisiting Visual Representations in Vision-Language ModelsarXiv:2101.00529