TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Models/GIT

GIT

Reported on 44 benchmarks across 4 tasks · 3 papers · 17 SOTA

Note: results are matched by exact model name. Different papers may use the same name for different model variants.

Natural Language Processing37 results

  • Visual Question Answering (VQA)onMSVD-QA
    Accuracy· uses extra data· 2022-05-27
    0.568
    best: 0.61 (VLAB)
    SOTA
    GIT: A Generative Image-to-text Transformer for Vision and LanguagearXiv:2205.14100
  • Image Captioningonnocaps-XD in-domain
    B2· 2022-05-27
    76.1
    SOTA
    GIT: A Generative Image-to-text Transformer for Vision and LanguagearXiv:2205.14100
  • Image Captioningonnocaps-XD in-domain
    B3· 2022-05-27
    60.53
    SOTA
    GIT: A Generative Image-to-text Transformer for Vision and LanguagearXiv:2205.14100
  • Image Captioningonnocaps-XD in-domain
    B4· 2022-05-27
    41.65
    SOTA
    GIT: A Generative Image-to-text Transformer for Vision and LanguagearXiv:2205.14100
  • Image Captioningonnocaps-XD in-domain
    ROUGE-L· 2022-05-27
    64.02
    SOTA
    GIT: A Generative Image-to-text Transformer for Vision and LanguagearXiv:2205.14100
  • Image Captioningonnocaps-XD out-of-domain
    B2· 2022-05-27
    71.28
    SOTA
    GIT: A Generative Image-to-text Transformer for Vision and LanguagearXiv:2205.14100
  • Image Captioningonnocaps-XD out-of-domain
    B3· 2022-05-27
    52.66
    SOTA
    GIT: A Generative Image-to-text Transformer for Vision and LanguagearXiv:2205.14100
  • Image Captioningonnocaps-XD out-of-domain
    METEOR· 2022-05-27
    30.45
    SOTA
    GIT: A Generative Image-to-text Transformer for Vision and LanguagearXiv:2205.14100
  • Image Captioningonnocaps-XD out-of-domain
    ROUGE-L· 2022-05-27
    60.96
    SOTA
    GIT: A Generative Image-to-text Transformer for Vision and LanguagearXiv:2205.14100
  • Image Captioningonnocaps-XD out-of-domain
    SPICE· 2022-05-27
    15.7
    SOTA
    GIT: A Generative Image-to-text Transformer for Vision and LanguagearXiv:2205.14100
  • Image CaptioningonCOCO Captions
    BLEU-4· 2022-05-27
    44.1
    best: 46.5 (mPLUG)
    GIT: A Generative Image-to-text Transformer for Vision and LanguagearXiv:2205.14100
  • Image CaptioningonCOCO Captions
    CIDER· 2022-05-27
    151.1
    best: 155.1 (mPLUG)
    GIT: A Generative Image-to-text Transformer for Vision and LanguagearXiv:2205.14100
  • Image CaptioningonCOCO Captions
    METEOR· 2022-05-27
    32.2
    best: 33.9 (CoCa)
    GIT: A Generative Image-to-text Transformer for Vision and LanguagearXiv:2205.14100
  • Image CaptioningonCOCO Captions
    SPICE· 2022-05-27
    26.3
    best: 27 (VAST)
    GIT: A Generative Image-to-text Transformer for Vision and LanguagearXiv:2205.14100
  • Image Captioningonnocaps-XD in-domain
    B1· 2022-05-27
    88.55
    best: 88.86 (GIT2)
    GIT: A Generative Image-to-text Transformer for Vision and LanguagearXiv:2205.14100
  • Image Captioningonnocaps-XD in-domain
    CIDEr· 2022-05-27
    122.4
    best: 124.18 (GIT2)
    GIT: A Generative Image-to-text Transformer for Vision and LanguagearXiv:2205.14100
  • Image Captioningonnocaps-XD in-domain
    METEOR· 2022-05-27
    33.41
    best: 33.83 (GIT2)
    GIT: A Generative Image-to-text Transformer for Vision and LanguagearXiv:2205.14100
  • Image Captioningonnocaps-XD in-domain
    SPICE· 2022-05-27
    16.18
    best: 16.36 (GIT2)
    GIT: A Generative Image-to-text Transformer for Vision and LanguagearXiv:2205.14100
  • Image Captioningonnocaps-XD near-domain
    B1· 2022-05-27
    88.56
    best: 88.9 (GIT2)
    GIT: A Generative Image-to-text Transformer for Vision and LanguagearXiv:2205.14100
  • Image Captioningonnocaps-XD near-domain
    B2· 2022-05-27
    75.48
    best: 75.86 (GIT2)
    GIT: A Generative Image-to-text Transformer for Vision and LanguagearXiv:2205.14100
  • Image Captioningonnocaps-XD near-domain
    B3· 2022-05-27
    58.46
    best: 58.9 (GIT2)
    GIT: A Generative Image-to-text Transformer for Vision and LanguagearXiv:2205.14100
  • Image Captioningonnocaps-XD near-domain
    B4· 2022-05-27
    38.44
    best: 38.95 (GIT2)
    GIT: A Generative Image-to-text Transformer for Vision and LanguagearXiv:2205.14100
  • Image Captioningonnocaps-XD near-domain
    CIDEr· 2022-05-27
    123.92
    best: 125.51 (GIT2)
    GIT: A Generative Image-to-text Transformer for Vision and LanguagearXiv:2205.14100
  • Image Captioningonnocaps-XD near-domain
    METEOR· 2022-05-27
    32.86
    best: 32.95 (GIT2)
    GIT: A Generative Image-to-text Transformer for Vision and LanguagearXiv:2205.14100
  • Image Captioningonnocaps-XD near-domain
    ROUGE-L· 2022-05-27
    63.5
    best: 63.66 (GIT2)
    GIT: A Generative Image-to-text Transformer for Vision and LanguagearXiv:2205.14100
  • Image Captioningonnocaps-XD near-domain
    SPICE· 2022-05-27
    15.96
    best: 16.11 (GIT2)
    GIT: A Generative Image-to-text Transformer for Vision and LanguagearXiv:2205.14100
  • Image Captioningonnocaps-XD entire
    B1· 2022-05-27
    88.1
    best: 88.43 (GIT2)
    GIT: A Generative Image-to-text Transformer for Vision and LanguagearXiv:2205.14100
  • Image Captioningonnocaps-XD entire
    B2· 2022-05-27
    74.81
    best: 75.02 (GIT2)
    GIT: A Generative Image-to-text Transformer for Vision and LanguagearXiv:2205.14100
  • Image Captioningonnocaps-XD entire
    B3· 2022-05-27
    57.68
    best: 57.87 (GIT2)
    GIT: A Generative Image-to-text Transformer for Vision and LanguagearXiv:2205.14100
  • Image Captioningonnocaps-XD entire
    B4· 2022-05-27
    37.35
    best: 37.65 (GIT2)
    GIT: A Generative Image-to-text Transformer for Vision and LanguagearXiv:2205.14100
  • Image Captioningonnocaps-XD entire
    CIDEr· 2022-05-27
    123.39
    best: 124.77 (GIT2)
    GIT: A Generative Image-to-text Transformer for Vision and LanguagearXiv:2205.14100
  • Image Captioningonnocaps-XD entire
    METEOR· 2022-05-27
    32.5
    best: 32.56 (GIT2)
    GIT: A Generative Image-to-text Transformer for Vision and LanguagearXiv:2205.14100
  • Image Captioningonnocaps-XD entire
    ROUGE-L· 2022-05-27
    63.12
    best: 63.19 (GIT2)
    GIT: A Generative Image-to-text Transformer for Vision and LanguagearXiv:2205.14100
  • Image Captioningonnocaps-XD entire
    SPICE· 2022-05-27
    15.94
    best: 16.06 (GIT2)
    GIT: A Generative Image-to-text Transformer for Vision and LanguagearXiv:2205.14100
  • Image Captioningonnocaps-XD out-of-domain
    B1· 2022-05-27
    85.99
    best: 86.28 (GIT2)
    GIT: A Generative Image-to-text Transformer for Vision and LanguagearXiv:2205.14100
  • Image Captioningonnocaps-XD out-of-domain
    B4· 2022-05-27
    30.04
    best: 30.15 (GIT2)
    GIT: A Generative Image-to-text Transformer for Vision and LanguagearXiv:2205.14100
  • Image Captioningonnocaps-XD out-of-domain
    CIDEr· 2022-05-27
    122.04
    best: 122.27 (GIT2)
    GIT: A Generative Image-to-text Transformer for Vision and LanguagearXiv:2205.14100

Computer Vision6 results

  • Video CaptioningonMSVD-CTN
    CIDEr· uses extra data· 2024-03-14
    45.63
    best: 63.51 (CEN)
    SOTA
    GiT: Towards Generalist Vision Transformer through Universal Language InterfacearXiv:2403.09394
  • Video CaptioningonMSVD-CTN
    ROUGE-L· uses extra data· 2024-03-14
    27.51
    best: 31.46 (CEN)
    SOTA
    GiT: Towards Generalist Vision Transformer through Universal Language InterfacearXiv:2403.09394
  • Video CaptioningonMSVD-CTN
    SPICE· uses extra data· 2024-03-14
    15.58
    best: 19.25 (CEN)
    SOTA
    GiT: Towards Generalist Vision Transformer through Universal Language InterfacearXiv:2403.09394
  • Video CaptioningonMSRVTT-CTN
    CIDEr· uses extra data· 2024-03-14
    32.43
    best: 49.87 (CEN)
    SOTA
    GiT: Towards Generalist Vision Transformer through Universal Language InterfacearXiv:2403.09394
  • Video CaptioningonMSRVTT-CTN
    ROUGE-L· uses extra data· 2024-03-14
    24.51
    best: 27.9 (CEN)
    SOTA
    GiT: Towards Generalist Vision Transformer through Universal Language InterfacearXiv:2403.09394
  • Video CaptioningonMSRVTT-CTN
    SPICE· uses extra data· 2024-03-14
    13.7
    best: 15.76 (CEN)
    SOTA
    GiT: Towards Generalist Vision Transformer through Universal Language InterfacearXiv:2403.09394

Reasoning1 result

  • Video Question AnsweringonRoadTextVQA
    ACCURACY· uses extra data· 2023-07-08
    29.58
    SOTA
    Reading Between the Lanes: Text VideoQA on the RoadarXiv:2307.03948