TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Models/Gemini-1.5 Pro

Gemini-1.5 Pro

Reported on 27 benchmarks across 4 tasks · 2 papers · 11 SOTA

Note: results are matched by exact model name. Different papers may use the same name for different model variants.

Natural Language Processing16 results

  • Visual Question Answering (VQA)onSME
    CIDEr· 2024-03-08
    276.14
    best: 510.44 (MEAgent)
    SOTA
    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of contextarXiv:2403.05530
  • Visual Question Answering (VQA)onSME
    ROUGE-L· 2024-03-08
    55.9
    best: 79.41 (MEAgent)
    SOTA
    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of contextarXiv:2403.05530
  • Visual Question Answering (VQA)onSME
    SPICE· 2024-03-08
    40.58
    best: 64.09 (MEAgent)
    SOTA
    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of contextarXiv:2403.05530
  • Visual Question AnsweringonSME
    CIDEr· 2024-03-08
    276.14
    best: 510.44 (MEAgent)
    SOTA
    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of contextarXiv:2403.05530
  • Visual Question AnsweringonSME
    ROUGE-L· 2024-03-08
    55.9
    best: 79.41 (MEAgent)
    SOTA
    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of contextarXiv:2403.05530
  • Visual Question AnsweringonSME
    SPICE· 2024-03-08
    40.58
    best: 64.09 (MEAgent)
    SOTA
    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of contextarXiv:2403.05530
  • Visual Question Answering (VQA)onSME
    #Learning Samples (N)· 2024-03-08
    16
    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of contextarXiv:2403.05530
  • Visual Question Answering (VQA)onSME
    ACC· 2024-03-08
    40.88
    best: 51.45 (MEAgent)
    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of contextarXiv:2403.05530
  • Visual Question Answering (VQA)onSME
    BLEU-4· 2024-03-08
    41.87
    best: 67.91 (MEAgent)
    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of contextarXiv:2403.05530
  • Visual Question Answering (VQA)onSME
    Detection· 2024-03-08
    1.4
    best: 29.09 (MEAgent)
    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of contextarXiv:2403.05530
  • Visual Question Answering (VQA)onSME
    METEOR· 2024-03-08
    34.61
    best: 50.55 (MEAgent)
    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of contextarXiv:2403.05530
  • Visual Question AnsweringonSME
    #Learning Samples (N)· 2024-03-08
    16
    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of contextarXiv:2403.05530
  • Visual Question AnsweringonSME
    ACC· 2024-03-08
    40.88
    best: 51.45 (MEAgent)
    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of contextarXiv:2403.05530
  • Visual Question AnsweringonSME
    BLEU-4· 2024-03-08
    41.87
    best: 67.91 (MEAgent)
    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of contextarXiv:2403.05530
  • Visual Question AnsweringonSME
    Detection· 2024-03-08
    1.4
    best: 29.09 (MEAgent)
    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of contextarXiv:2403.05530
  • Visual Question AnsweringonSME
    METEOR· 2024-03-08
    34.61
    best: 50.55 (MEAgent)
    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of contextarXiv:2403.05530

Computer Vision8 results

  • Explanatory Visual Question AnsweringonSME
    CIDEr· 2024-03-08
    276.14
    best: 510.44 (MEAgent)
    SOTA
    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of contextarXiv:2403.05530
  • Explanatory Visual Question AnsweringonSME
    ROUGE-L· 2024-03-08
    55.9
    best: 79.41 (MEAgent)
    SOTA
    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of contextarXiv:2403.05530
  • Explanatory Visual Question AnsweringonSME
    SPICE· 2024-03-08
    40.58
    best: 64.09 (MEAgent)
    SOTA
    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of contextarXiv:2403.05530
  • Explanatory Visual Question AnsweringonSME
    #Learning Samples (N)· 2024-03-08
    16
    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of contextarXiv:2403.05530
  • Explanatory Visual Question AnsweringonSME
    ACC· 2024-03-08
    40.88
    best: 51.45 (MEAgent)
    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of contextarXiv:2403.05530
  • Explanatory Visual Question AnsweringonSME
    BLEU-4· 2024-03-08
    41.87
    best: 67.91 (MEAgent)
    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of contextarXiv:2403.05530
  • Explanatory Visual Question AnsweringonSME
    Detection· 2024-03-08
    1.4
    best: 29.09 (MEAgent)
    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of contextarXiv:2403.05530
  • Explanatory Visual Question AnsweringonSME
    METEOR· 2024-03-08
    34.61
    best: 50.55 (MEAgent)
    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of contextarXiv:2403.05530

Methodology3 results

  • Optical Character Recognition (OCR)onVideoDB's OCR Benchmark Public Collection
    Character Error Rate (CER)· 2025-02-10
    0.2387
    best: 0.2378 (GPT-4o)
    SOTA
    Benchmarking Vision-Language Models on Optical Character Recognition in Dynamic Video EnvironmentsarXiv:2502.06445
  • Optical Character Recognition (OCR)onVideoDB's OCR Benchmark Public Collection
    Word Error Rate (WER)· 2025-02-10
    0.2385
    SOTA
    Benchmarking Vision-Language Models on Optical Character Recognition in Dynamic Video EnvironmentsarXiv:2502.06445
  • Optical Character Recognition (OCR)onVideoDB's OCR Benchmark Public Collection
    Average Accuracy· 2025-02-10
    76.13
    best: 76.22 (GPT-4o)
    Benchmarking Vision-Language Models on Optical Character Recognition in Dynamic Video EnvironmentsarXiv:2502.06445