Tasks SotA Datasets Papers Methods Submit About

Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable Benchmarks All SotA Datasets Papers Methods

Community

Submit Results About

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Models/GPT-4V

GPT-4V

Reported on 34 benchmarks across 8 tasks · 6 papers · 22 SOTA

Note: results are matched by exact model name. Different papers may use the same name for different model variants.

Natural Language Processing21 results

Visual Question Answering (VQA)onAutoHallusion
Overall Accuracy· 2024-06-16
66
SOTA
AutoHallusion: Automatic Generation of Hallucination Benchmarks for Vision-Language Models arXiv:2406.10900
Visual Question Answering (VQA)onHallusionBench
Question Pair Acc · 2023-10-23
12.2047
SOTA
HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models arXiv:2310.14566
Visual Question Answering (VQA)onCORE-MM
Abductive· 2023-03-15
77.88
SOTA
GPT-4 Technical Report arXiv:2303.08774
Visual Question Answering (VQA)onCORE-MM
Analogical· 2023-03-15
69.86
SOTA
GPT-4 Technical Report arXiv:2303.08774
Visual Question Answering (VQA)onCORE-MM
Deductive· 2023-03-15
74.86
SOTA
GPT-4 Technical Report arXiv:2303.08774
Visual Question Answering (VQA)onCORE-MM
Overall score· 2023-03-15
74.44
SOTA
GPT-4 Technical Report arXiv:2303.08774
Visual Question Answering (VQA)onInfiMM-Eval
Abductive· 2023-03-15
77.88
SOTA
GPT-4 Technical Report arXiv:2303.08774
Visual Question Answering (VQA)onInfiMM-Eval
Analogical· 2023-03-15
69.86
SOTA
GPT-4 Technical Report arXiv:2303.08774
Visual Question Answering (VQA)onInfiMM-Eval
Deductive· 2023-03-15
74.86
SOTA
GPT-4 Technical Report arXiv:2303.08774
Visual Question Answering (VQA)onInfiMM-Eval
Overall score· 2023-03-15
74.44
SOTA
GPT-4 Technical Report arXiv:2303.08774
Visual Question Answering (VQA)onBenchLMM
GPT-3.5 score· uses extra data· 2023-03-15
58.37
SOTA
GPT-4 Technical Report arXiv:2303.08774
Visual Question Answering (VQA)onEmbSpatial-Bench
Generation· 2023-03-15
36.07
best: 70.88 (SoFar)
SOTA
GPT-4 Technical Report arXiv:2303.08774
Visual Question AnsweringonBenchLMM
GPT-3.5 score· uses extra data· 2023-03-15
58.37
SOTA
GPT-4 Technical Report arXiv:2303.08774
Visual Question AnsweringonEmbSpatial-Bench
Generation· 2023-03-15
36.07
best: 70.88 (SoFar)
SOTA
GPT-4 Technical Report arXiv:2303.08774
Long-Context UnderstandingonMMNeedle
1 Image, 2*2 Stitching, Exact Accuracy· 2023-03-15
86.09
best: 94.6 (GPT-4o)
GPT-4 Technical Report arXiv:2303.08774
Long-Context UnderstandingonMMNeedle
1 Image, 4*4 Stitching, Exact Accuracy· 2023-03-15
54.72
best: 83 (GPT-4o)
GPT-4 Technical Report arXiv:2303.08774
Long-Context UnderstandingonMMNeedle
1 Image, 8*8 Stitching, Exact Accuracy· 2023-03-15
7.3
best: 29.81 (Gemini Pro 1.5)
GPT-4 Technical Report arXiv:2303.08774
Long-Context UnderstandingonMMNeedle
10 Images, 1*1 Stitching, Exact Accuracy· 2023-03-15
72.36
best: 97 (GPT-4o)
GPT-4 Technical Report arXiv:2303.08774
Long-Context UnderstandingonMMNeedle
10 Images, 2*2 Stitching, Exact Accuracy· 2023-03-15
34.24
best: 81.8 (GPT-4o)
GPT-4 Technical Report arXiv:2303.08774
Long-Context UnderstandingonMMNeedle
10 Images, 4*4 Stitching, Exact Accuracy· 2023-03-15
7.58
best: 26.9 (GPT-4o)
GPT-4 Technical Report arXiv:2303.08774
Long-Context UnderstandingonMMNeedle
10 Images, 8*8 Stitching, Exact Accuracy
0
best: 1 (GPT-4o)

Robots5 results

Object RearrangementonOpen6DOR V2
pos-level0· 2023-03-15
39.1
best: 96 (SoFar)
SOTA
GPT-4 Technical Report arXiv:2303.08774
Object RearrangementonOpen6DOR V2
pos-level1· 2023-03-15
46.8
best: 81.5 (SoFar)
SOTA
GPT-4 Technical Report arXiv:2303.08774
Object RearrangementonOpen6DOR V2
rot-level0· 2023-03-15
9.1
best: 68.6 (SoFar)
SOTA
GPT-4 Technical Report arXiv:2303.08774
Object RearrangementonOpen6DOR V2
rot-level1· 2023-03-15
6.9
best: 42.2 (SoFar)
SOTA
GPT-4 Technical Report arXiv:2303.08774
Object RearrangementonOpen6DOR V2
rot-level2· 2023-03-15
11.7
best: 70.1 (SoFar)
SOTA
GPT-4 Technical Report arXiv:2303.08774

Reasoning4 results

Multimodal ReasoningonREBUS
Accuracy· 2024-01-11
24
SOTA
REBUS: A Robust Evaluation Benchmark of Understanding Symbols arXiv:2401.05604
Visual ReasoningonWinoground
Group Score· 2024-01-05
37.75
best: 58.75 (GPT-4V (CoT, pick b/w two options))
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs arXiv:2401.02582
Visual ReasoningonWinoground
Image Score· 2024-01-05
42.5
best: 68.75 (GPT-4V (CoT, pick b/w two options))
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs arXiv:2401.02582
Visual ReasoningonWinoground
Text Score· 2024-01-05
54.5
best: 75.5 (GPT-4o + CA)
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs arXiv:2401.02582

Other3 results

Factual Inconsistency Detection in Chart CaptioningonCHOCOLATE-LLM
Kendall's Tau-c· 2023-03-15
0.205
SOTA
GPT-4 Technical Report arXiv:2303.08774
Factual Inconsistency Detection in Chart CaptioningonCHOCOLATE-LVLM
Kendall's Tau-c
0.157
best: 0.178 (ChartVE)
Factual Inconsistency Detection in Chart CaptioningonCHOCOLATE-FT
Kendall's Tau-c
0.215
best: 0.291 (Bard (before Gemini))

Computer Vision1 result

MMR totalonMRR-Benchmark
Total Column Score· uses extra data· 2023-09-29
415
best: 463 (Claude 3.5 Sonnet)
SOTA
The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)arXiv:2309.17421