Tasks SotA Datasets Papers Methods Submit About

Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable Benchmarks All SotA Datasets Papers Methods

Community

Submit Results About

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Models/GIT

GIT

Reported on 44 benchmarks across 4 tasks · 3 papers · 17 SOTA

Note: results are matched by exact model name. Different papers may use the same name for different model variants.

Natural Language Processing37 results

Visual Question Answering (VQA)onMSVD-QA
Accuracy· uses extra data· 2022-05-27
0.568
best: 0.61 (VLAB)
SOTA
GIT: A Generative Image-to-text Transformer for Vision and Language arXiv:2205.14100
Image Captioningonnocaps-XD in-domain
B2· 2022-05-27
76.1
SOTA
GIT: A Generative Image-to-text Transformer for Vision and Language arXiv:2205.14100
Image Captioningonnocaps-XD in-domain
B3· 2022-05-27
60.53
SOTA
GIT: A Generative Image-to-text Transformer for Vision and Language arXiv:2205.14100
Image Captioningonnocaps-XD in-domain
B4· 2022-05-27
41.65
SOTA
GIT: A Generative Image-to-text Transformer for Vision and Language arXiv:2205.14100
Image Captioningonnocaps-XD in-domain
ROUGE-L· 2022-05-27
64.02
SOTA
GIT: A Generative Image-to-text Transformer for Vision and Language arXiv:2205.14100
Image Captioningonnocaps-XD out-of-domain
B2· 2022-05-27
71.28
SOTA
GIT: A Generative Image-to-text Transformer for Vision and Language arXiv:2205.14100
Image Captioningonnocaps-XD out-of-domain
B3· 2022-05-27
52.66
SOTA
GIT: A Generative Image-to-text Transformer for Vision and Language arXiv:2205.14100
Image Captioningonnocaps-XD out-of-domain
METEOR· 2022-05-27
30.45
SOTA
GIT: A Generative Image-to-text Transformer for Vision and Language arXiv:2205.14100
Image Captioningonnocaps-XD out-of-domain
ROUGE-L· 2022-05-27
60.96
SOTA
GIT: A Generative Image-to-text Transformer for Vision and Language arXiv:2205.14100
Image Captioningonnocaps-XD out-of-domain
SPICE· 2022-05-27
15.7
SOTA
GIT: A Generative Image-to-text Transformer for Vision and Language arXiv:2205.14100
Image CaptioningonCOCO Captions
BLEU-4· 2022-05-27
44.1
best: 46.5 (mPLUG)
GIT: A Generative Image-to-text Transformer for Vision and Language arXiv:2205.14100
Image CaptioningonCOCO Captions
CIDER· 2022-05-27
151.1
best: 155.1 (mPLUG)
GIT: A Generative Image-to-text Transformer for Vision and Language arXiv:2205.14100
Image CaptioningonCOCO Captions
METEOR· 2022-05-27
32.2
best: 33.9 (CoCa)
GIT: A Generative Image-to-text Transformer for Vision and Language arXiv:2205.14100
Image CaptioningonCOCO Captions
SPICE· 2022-05-27
26.3
best: 27 (VAST)
GIT: A Generative Image-to-text Transformer for Vision and Language arXiv:2205.14100
Image Captioningonnocaps-XD in-domain
B1· 2022-05-27
88.55
best: 88.86 (GIT2)
GIT: A Generative Image-to-text Transformer for Vision and Language arXiv:2205.14100
Image Captioningonnocaps-XD in-domain
CIDEr· 2022-05-27
122.4
best: 124.18 (GIT2)
GIT: A Generative Image-to-text Transformer for Vision and Language arXiv:2205.14100
Image Captioningonnocaps-XD in-domain
METEOR· 2022-05-27
33.41
best: 33.83 (GIT2)
GIT: A Generative Image-to-text Transformer for Vision and Language arXiv:2205.14100
Image Captioningonnocaps-XD in-domain
SPICE· 2022-05-27
16.18
best: 16.36 (GIT2)
GIT: A Generative Image-to-text Transformer for Vision and Language arXiv:2205.14100
Image Captioningonnocaps-XD near-domain
B1· 2022-05-27
88.56
best: 88.9 (GIT2)
GIT: A Generative Image-to-text Transformer for Vision and Language arXiv:2205.14100
Image Captioningonnocaps-XD near-domain
B2· 2022-05-27
75.48
best: 75.86 (GIT2)
GIT: A Generative Image-to-text Transformer for Vision and Language arXiv:2205.14100
Image Captioningonnocaps-XD near-domain
B3· 2022-05-27
58.46
best: 58.9 (GIT2)
GIT: A Generative Image-to-text Transformer for Vision and Language arXiv:2205.14100
Image Captioningonnocaps-XD near-domain
B4· 2022-05-27
38.44
best: 38.95 (GIT2)
GIT: A Generative Image-to-text Transformer for Vision and Language arXiv:2205.14100
Image Captioningonnocaps-XD near-domain
CIDEr· 2022-05-27
123.92
best: 125.51 (GIT2)
GIT: A Generative Image-to-text Transformer for Vision and Language arXiv:2205.14100
Image Captioningonnocaps-XD near-domain
METEOR· 2022-05-27
32.86
best: 32.95 (GIT2)
GIT: A Generative Image-to-text Transformer for Vision and Language arXiv:2205.14100
Image Captioningonnocaps-XD near-domain
ROUGE-L· 2022-05-27
63.5
best: 63.66 (GIT2)
GIT: A Generative Image-to-text Transformer for Vision and Language arXiv:2205.14100
Image Captioningonnocaps-XD near-domain
SPICE· 2022-05-27
15.96
best: 16.11 (GIT2)
GIT: A Generative Image-to-text Transformer for Vision and Language arXiv:2205.14100
Image Captioningonnocaps-XD entire
B1· 2022-05-27
88.1
best: 88.43 (GIT2)
GIT: A Generative Image-to-text Transformer for Vision and Language arXiv:2205.14100
Image Captioningonnocaps-XD entire
B2· 2022-05-27
74.81
best: 75.02 (GIT2)
GIT: A Generative Image-to-text Transformer for Vision and Language arXiv:2205.14100
Image Captioningonnocaps-XD entire
B3· 2022-05-27
57.68
best: 57.87 (GIT2)
GIT: A Generative Image-to-text Transformer for Vision and Language arXiv:2205.14100
Image Captioningonnocaps-XD entire
B4· 2022-05-27
37.35
best: 37.65 (GIT2)
GIT: A Generative Image-to-text Transformer for Vision and Language arXiv:2205.14100
Image Captioningonnocaps-XD entire
CIDEr· 2022-05-27
123.39
best: 124.77 (GIT2)
GIT: A Generative Image-to-text Transformer for Vision and Language arXiv:2205.14100
Image Captioningonnocaps-XD entire
METEOR· 2022-05-27
32.5
best: 32.56 (GIT2)
GIT: A Generative Image-to-text Transformer for Vision and Language arXiv:2205.14100
Image Captioningonnocaps-XD entire
ROUGE-L· 2022-05-27
63.12
best: 63.19 (GIT2)
GIT: A Generative Image-to-text Transformer for Vision and Language arXiv:2205.14100
Image Captioningonnocaps-XD entire
SPICE· 2022-05-27
15.94
best: 16.06 (GIT2)
GIT: A Generative Image-to-text Transformer for Vision and Language arXiv:2205.14100
Image Captioningonnocaps-XD out-of-domain
B1· 2022-05-27
85.99
best: 86.28 (GIT2)
GIT: A Generative Image-to-text Transformer for Vision and Language arXiv:2205.14100
Image Captioningonnocaps-XD out-of-domain
B4· 2022-05-27
30.04
best: 30.15 (GIT2)
GIT: A Generative Image-to-text Transformer for Vision and Language arXiv:2205.14100
Image Captioningonnocaps-XD out-of-domain
CIDEr· 2022-05-27
122.04
best: 122.27 (GIT2)
GIT: A Generative Image-to-text Transformer for Vision and Language arXiv:2205.14100

Computer Vision6 results

Video CaptioningonMSVD-CTN
CIDEr· uses extra data· 2024-03-14
45.63
best: 63.51 (CEN)
SOTA
GiT: Towards Generalist Vision Transformer through Universal Language Interface arXiv:2403.09394
Video CaptioningonMSVD-CTN
ROUGE-L· uses extra data· 2024-03-14
27.51
best: 31.46 (CEN)
SOTA
GiT: Towards Generalist Vision Transformer through Universal Language Interface arXiv:2403.09394
Video CaptioningonMSVD-CTN
SPICE· uses extra data· 2024-03-14
15.58
best: 19.25 (CEN)
SOTA
GiT: Towards Generalist Vision Transformer through Universal Language Interface arXiv:2403.09394
Video CaptioningonMSRVTT-CTN
CIDEr· uses extra data· 2024-03-14
32.43
best: 49.87 (CEN)
SOTA
GiT: Towards Generalist Vision Transformer through Universal Language Interface arXiv:2403.09394
Video CaptioningonMSRVTT-CTN
ROUGE-L· uses extra data· 2024-03-14
24.51
best: 27.9 (CEN)
SOTA
GiT: Towards Generalist Vision Transformer through Universal Language Interface arXiv:2403.09394
Video CaptioningonMSRVTT-CTN
SPICE· uses extra data· 2024-03-14
13.7
best: 15.76 (CEN)
SOTA
GiT: Towards Generalist Vision Transformer through Universal Language Interface arXiv:2403.09394

Reasoning1 result

Video Question AnsweringonRoadTextVQA
ACCURACY· uses extra data· 2023-07-08
29.58
SOTA
Reading Between the Lanes: Text VideoQA on the Road arXiv:2307.03948