Tasks SotA Datasets Papers Methods Submit About

Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable Benchmarks All SotA Datasets Papers Methods

Community

Submit Results About

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Models/Gemini-1.5 Pro

Gemini-1.5 Pro

Reported on 27 benchmarks across 4 tasks · 2 papers · 11 SOTA

Note: results are matched by exact model name. Different papers may use the same name for different model variants.

Natural Language Processing16 results

Visual Question Answering (VQA)onSME
CIDEr· 2024-03-08
276.14
best: 510.44 (MEAgent)
SOTA
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context arXiv:2403.05530
Visual Question Answering (VQA)onSME
ROUGE-L· 2024-03-08
55.9
best: 79.41 (MEAgent)
SOTA
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context arXiv:2403.05530
Visual Question Answering (VQA)onSME
SPICE· 2024-03-08
40.58
best: 64.09 (MEAgent)
SOTA
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context arXiv:2403.05530
Visual Question AnsweringonSME
CIDEr· 2024-03-08
276.14
best: 510.44 (MEAgent)
SOTA
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context arXiv:2403.05530
Visual Question AnsweringonSME
ROUGE-L· 2024-03-08
55.9
best: 79.41 (MEAgent)
SOTA
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context arXiv:2403.05530
Visual Question AnsweringonSME
SPICE· 2024-03-08
40.58
best: 64.09 (MEAgent)
SOTA
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context arXiv:2403.05530
Visual Question Answering (VQA)onSME
#Learning Samples (N)· 2024-03-08
16
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context arXiv:2403.05530
Visual Question Answering (VQA)onSME
ACC· 2024-03-08
40.88
best: 51.45 (MEAgent)
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context arXiv:2403.05530
Visual Question Answering (VQA)onSME
BLEU-4· 2024-03-08
41.87
best: 67.91 (MEAgent)
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context arXiv:2403.05530
Visual Question Answering (VQA)onSME
Detection· 2024-03-08
1.4
best: 29.09 (MEAgent)
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context arXiv:2403.05530
Visual Question Answering (VQA)onSME
METEOR· 2024-03-08
34.61
best: 50.55 (MEAgent)
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context arXiv:2403.05530
Visual Question AnsweringonSME
#Learning Samples (N)· 2024-03-08
16
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context arXiv:2403.05530
Visual Question AnsweringonSME
ACC· 2024-03-08
40.88
best: 51.45 (MEAgent)
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context arXiv:2403.05530
Visual Question AnsweringonSME
BLEU-4· 2024-03-08
41.87
best: 67.91 (MEAgent)
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context arXiv:2403.05530
Visual Question AnsweringonSME
Detection· 2024-03-08
1.4
best: 29.09 (MEAgent)
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context arXiv:2403.05530
Visual Question AnsweringonSME
METEOR· 2024-03-08
34.61
best: 50.55 (MEAgent)
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context arXiv:2403.05530

Computer Vision8 results

Explanatory Visual Question AnsweringonSME
CIDEr· 2024-03-08
276.14
best: 510.44 (MEAgent)
SOTA
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context arXiv:2403.05530
Explanatory Visual Question AnsweringonSME
ROUGE-L· 2024-03-08
55.9
best: 79.41 (MEAgent)
SOTA
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context arXiv:2403.05530
Explanatory Visual Question AnsweringonSME
SPICE· 2024-03-08
40.58
best: 64.09 (MEAgent)
SOTA
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context arXiv:2403.05530
Explanatory Visual Question AnsweringonSME
#Learning Samples (N)· 2024-03-08
16
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context arXiv:2403.05530
Explanatory Visual Question AnsweringonSME
ACC· 2024-03-08
40.88
best: 51.45 (MEAgent)
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context arXiv:2403.05530
Explanatory Visual Question AnsweringonSME
BLEU-4· 2024-03-08
41.87
best: 67.91 (MEAgent)
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context arXiv:2403.05530
Explanatory Visual Question AnsweringonSME
Detection· 2024-03-08
1.4
best: 29.09 (MEAgent)
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context arXiv:2403.05530
Explanatory Visual Question AnsweringonSME
METEOR· 2024-03-08
34.61
best: 50.55 (MEAgent)
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context arXiv:2403.05530

Methodology3 results

Optical Character Recognition (OCR)onVideoDB's OCR Benchmark Public Collection
Character Error Rate (CER)· 2025-02-10
0.2387
best: 0.2378 (GPT-4o)
SOTA
Benchmarking Vision-Language Models on Optical Character Recognition in Dynamic Video Environments arXiv:2502.06445
Optical Character Recognition (OCR)onVideoDB's OCR Benchmark Public Collection
Word Error Rate (WER)· 2025-02-10
0.2385
SOTA
Benchmarking Vision-Language Models on Optical Character Recognition in Dynamic Video Environments arXiv:2502.06445
Optical Character Recognition (OCR)onVideoDB's OCR Benchmark Public Collection
Average Accuracy· 2025-02-10
76.13
best: 76.22 (GPT-4o)
Benchmarking Vision-Language Models on Optical Character Recognition in Dynamic Video Environments arXiv:2502.06445