TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

SotA/Natural Language Processing/Visual Question Answering (VQA)/ViP-Bench

Visual Question Answering (VQA) on ViP-Bench

Metric: GPT-4 score (bbox) (higher is better)

LeaderboardDataset
Loading chart...

Results

Submit a result
#Model↕GPT-4 score (bbox)▼Extra DataPaperDate↕Code
1GPT-4V-turbo-detail:high (Visual Prompt)60.7NoGPT-4 Technical Report2023-03-15Code
2GPT-4V-turbo-detail:low (Visual Prompt)52.8NoGPT-4 Technical Report2023-03-15Code
3LLaVA-NeXT-Inst-IT-Qwen2-7B (Visual Prompt50.5YesInst-IT: Boosting Multimodal Instance Understand...2024-12-04Code
4ViP-LLaVA-13B (Visual Prompt)48.3NoMaking Large Language Models Better Data Creators2023-10-31Code
5LLaVA-1.5-13B (Coordinates)47.1NoImproved Baselines with Visual Instruction Tuning2023-10-05Code
6Qwen-VL-Chat (Coordinates)45.3NoQwen-VL: A Versatile Vision-Language Model for U...2023-08-24Code
7LLaVA-NeXT-Inst-IT-Vicuna-7B (Visual Prompt45.1YesInst-IT: Boosting Multimodal Instance Understand...2024-12-04Code
8LLaVA-1.5-13B (Visual Prompt)41.8NoImproved Baselines with Visual Instruction Tuning2023-10-05Code
9Qwen-VL-Chat (Visual Prompt)39.2NoQwen-VL: A Versatile Vision-Language Model for U...2023-08-24Code
10InstructBLIP-13B (Visual Prompt)35.8NoInstructBLIP: Towards General-purpose Vision-Lan...2023-05-11Code
11GPT4ROI 7B (ROI)35.1NoGPT4RoI: Instruction Tuning Large Language Model...2023-07-07Code
12Shikra-7B (Coordinates)33.7NoShikra: Unleashing Multimodal LLM's Referential ...2023-06-27Code
13Kosmos-2 (Discrete Token)26.9NoKosmos-2: Grounding Multimodal Large Language Mo...2023-06-26Code