TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Models/GPT-4o

GPT-4o

Reported on 45 benchmarks across 11 tasks · 5 papers · 23 SOTA

Note: results are matched by exact model name. Different papers may use the same name for different model variants.

Natural Language Processing37 results

  • Visual Question Answering (VQA)onVLM2-Bench
    Average Score on VLM2-bench (9 subtasks)· 2024-10-25
    60.36
    SOTA
    GPT-4o System CardarXiv:2410.21276
  • Visual Question Answering (VQA)onVLM2-Bench
    GC-mat· 2024-10-25
    37.45
    SOTA
    GPT-4o System CardarXiv:2410.21276
  • Visual Question Answering (VQA)onVLM2-Bench
    GC-trk· 2024-10-25
    39.27
    best: 43.38 (Qwen2.5-VL-7B)
    SOTA
    GPT-4o System CardarXiv:2410.21276
  • Visual Question Answering (VQA)onVLM2-Bench
    OC-cnt· 2024-10-25
    80.62
    SOTA
    GPT-4o System CardarXiv:2410.21276
  • Visual Question Answering (VQA)onVLM2-Bench
    OC-cpr· 2024-10-25
    74.17
    SOTA
    GPT-4o System CardarXiv:2410.21276
  • Visual Question Answering (VQA)onVLM2-Bench
    OC-grp· 2024-10-25
    57.5
    SOTA
    GPT-4o System CardarXiv:2410.21276
  • Visual Question Answering (VQA)onVLM2-Bench
    PC-VID· 2024-10-25
    66.75
    SOTA
    GPT-4o System CardarXiv:2410.21276
  • Visual Question Answering (VQA)onVLM2-Bench
    PC-cnt· 2024-10-25
    90.5
    SOTA
    GPT-4o System CardarXiv:2410.21276
  • Visual Question Answering (VQA)on6-DoF SpatialBench
    Orientation-rel· 2024-10-25
    44.2
    best: 54.6 (SoFar)
    SOTA
    GPT-4o System CardarXiv:2410.21276
  • Visual Question Answering (VQA)on6-DoF SpatialBench
    Total· 2024-10-25
    36.2
    best: 43.9 (SoFar)
    SOTA
    GPT-4o System CardarXiv:2410.21276
  • Visual Question Answeringon6-DoF SpatialBench
    Orientation-rel· 2024-10-25
    44.2
    best: 54.6 (SoFar)
    SOTA
    GPT-4o System CardarXiv:2410.21276
  • Visual Question Answeringon6-DoF SpatialBench
    Total· 2024-10-25
    36.2
    best: 43.9 (SoFar)
    SOTA
    GPT-4o System CardarXiv:2410.21276
  • Long-Context UnderstandingonMMNeedle
    1 Image, 2*2 Stitching, Exact Accuracy· 2023-03-15
    94.6
    SOTA
    GPT-4 Technical ReportarXiv:2303.08774
  • Long-Context UnderstandingonMMNeedle
    1 Image, 4*4 Stitching, Exact Accuracy· 2023-03-15
    83
    SOTA
    GPT-4 Technical ReportarXiv:2303.08774
  • Long-Context UnderstandingonMMNeedle
    1 Image, 8*8 Stitching, Exact Accuracy· 2023-03-15
    19
    best: 29.81 (Gemini Pro 1.5)
    SOTA
    GPT-4 Technical ReportarXiv:2303.08774
  • Long-Context UnderstandingonMMNeedle
    10 Images, 1*1 Stitching, Exact Accuracy· 2023-03-15
    97
    SOTA
    GPT-4 Technical ReportarXiv:2303.08774
  • Long-Context UnderstandingonMMNeedle
    10 Images, 2*2 Stitching, Exact Accuracy· 2023-03-15
    81.8
    SOTA
    GPT-4 Technical ReportarXiv:2303.08774
  • Long-Context UnderstandingonMMNeedle
    10 Images, 4*4 Stitching, Exact Accuracy· 2023-03-15
    26.9
    SOTA
    GPT-4 Technical ReportarXiv:2303.08774
  • Long-Context UnderstandingonMMNeedle
    10 Images, 8*8 Stitching, Exact Accuracy· 2023-03-15
    1
    SOTA
    GPT-4 Technical ReportarXiv:2303.08774
  • Description-guided molecule generationonTOMG-Bench
    wAcc· 2024-12-19
    32.29
    best: 35.92 (Claude-3.5)
    TOMG-Bench: Evaluating LLMs on Text-based Open Molecule GenerationarXiv:2412.14642
  • Visual Question Answering (VQA)onVLM2-Bench
    PC-cpr· 2024-10-25
    50
    best: 80 (Qwen2.5-VL-7B)
    GPT-4o System CardarXiv:2410.21276
  • Visual Question Answering (VQA)onVLM2-Bench
    PC-grp· 2024-10-25
    47
    best: 69 (Qwen2.5-VL-7B)
    GPT-4o System CardarXiv:2410.21276
  • Visual Question Answering (VQA)on6-DoF SpatialBench
    Orientation-abs· 2024-10-25
    25.8
    best: 31.3 (SoFar)
    GPT-4o System CardarXiv:2410.21276
  • Visual Question Answering (VQA)on6-DoF SpatialBench
    Position-abs· 2024-10-25
    28.4
    best: 33.8 (SoFar)
    GPT-4o System CardarXiv:2410.21276
  • Visual Question Answering (VQA)on6-DoF SpatialBench
    Position-rel· 2024-10-25
    49.4
    best: 59.6 (SoFar)
    GPT-4o System CardarXiv:2410.21276
  • Visual Question Answeringon6-DoF SpatialBench
    Orientation-abs· 2024-10-25
    25.8
    best: 31.3 (SoFar)
    GPT-4o System CardarXiv:2410.21276
  • Visual Question Answeringon6-DoF SpatialBench
    Position-abs· 2024-10-25
    28.4
    best: 33.8 (SoFar)
    GPT-4o System CardarXiv:2410.21276
  • Visual Question Answeringon6-DoF SpatialBench
    Position-rel· 2024-10-25
    49.4
    best: 59.6 (SoFar)
    GPT-4o System CardarXiv:2410.21276
  • Question AnsweringonVideo-MME (w/o subs)
    Accuracy (%)· 2024-06-14
    70.3
    best: 77.4 (Video-RAG (based on LLaVA-Video))
    GPT-4o: Visual perception performance of multimodal large language models in piglet activity understandingarXiv:2406.09781
  • Question AnsweringonZero-shot Video Question Answering on LongVideoBench
    Accuracy (% )· uses extra data· 2024-06-14
    64
    best: 66.7 (Gemini 1.5 Pro)
    GPT-4o: Visual perception performance of multimodal large language models in piglet activity understandingarXiv:2406.09781
  • Question AnsweringonVideo-MME
    Accuracy (%)· 2024-06-14
    77.2
    best: 81.3 (Gemini 1.5 Pro)
    GPT-4o: Visual perception performance of multimodal large language models in piglet activity understandingarXiv:2406.09781
  • Relation ExtractiononVinoground
    Group Score
    24.6
    best: 35 (GPT-4o (CoT))
  • Relation ExtractiononVinoground
    Text Score
    54
    best: 59.2 (GPT-4o (CoT))
  • Relation ExtractiononVinoground
    Video Score
    38.2
    best: 51 (GPT-4o (CoT))
  • Temporal Relation ExtractiononVinoground
    Group Score
    24.6
    best: 35 (GPT-4o (CoT))
  • Temporal Relation ExtractiononVinoground
    Text Score
    54
    best: 59.2 (GPT-4o (CoT))
  • Temporal Relation ExtractiononVinoground
    Video Score
    38.2
    best: 51 (GPT-4o (CoT))

Methodology3 results

  • Optical Character Recognition (OCR)onVideoDB's OCR Benchmark Public Collection
    Average Accuracy· 2025-02-10
    76.22
    SOTA
    Benchmarking Vision-Language Models on Optical Character Recognition in Dynamic Video EnvironmentsarXiv:2502.06445
  • Optical Character Recognition (OCR)onVideoDB's OCR Benchmark Public Collection
    Character Error Rate (CER)· 2025-02-10
    0.2378
    SOTA
    Benchmarking Vision-Language Models on Optical Character Recognition in Dynamic Video EnvironmentsarXiv:2502.06445
  • Optical Character Recognition (OCR)onVideoDB's OCR Benchmark Public Collection
    Word Error Rate (WER)· 2025-02-10
    0.5117
    best: 0.2385 (Gemini-1.5 Pro)
    SOTA
    Benchmarking Vision-Language Models on Optical Character Recognition in Dynamic Video EnvironmentsarXiv:2502.06445

Reasoning3 results

  • Video Question AnsweringonVideo-MME (w/o subs)
    Accuracy (%)· 2024-06-14
    70.3
    best: 77.4 (Video-RAG (based on LLaVA-Video))
    GPT-4o: Visual perception performance of multimodal large language models in piglet activity understandingarXiv:2406.09781
  • Video Question AnsweringonZero-shot Video Question Answering on LongVideoBench
    Accuracy (% )· uses extra data· 2024-06-14
    64
    best: 66.7 (Gemini 1.5 Pro)
    GPT-4o: Visual perception performance of multimodal large language models in piglet activity understandingarXiv:2406.09781
  • Video Question AnsweringonVideo-MME
    Accuracy (%)· 2024-06-14
    77.2
    best: 81.3 (Gemini 1.5 Pro)
    GPT-4o: Visual perception performance of multimodal large language models in piglet activity understandingarXiv:2406.09781

Computer Vision1 result

  • MMR totalonMRR-Benchmark
    Total Column Score· uses extra data· 2024-06-14
    457
    best: 463 (Claude 3.5 Sonnet)
    SOTA
    GPT-4o: Visual perception performance of multimodal large language models in piglet activity understandingarXiv:2406.09781

Knowledge Base1 result

  • Mathematical ReasoningonFrontierMath
    Accuracy
    0.01
    best: 0.252 (o3)