TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Models/Qwen2-VL-7B

Qwen2-VL-7B

Reported on 22 benchmarks across 7 tasks · 1 paper · 5 SOTA

Note: results are matched by exact model name. Different papers may use the same name for different model variants.

Natural Language Processing19 results

  • Visual Question Answering (VQA)onVLM2-Bench
    Average Score on VLM2-bench (9 subtasks)· 2024-09-18
    42.37
    best: 60.36 (GPT-4o)
    SOTA
    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any ResolutionarXiv:2409.12191
  • Visual Question Answering (VQA)onVLM2-Bench
    GC-mat· 2024-09-18
    27.8
    best: 37.45 (GPT-4o)
    SOTA
    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any ResolutionarXiv:2409.12191
  • Visual Question Answering (VQA)onVLM2-Bench
    OC-cpr· 2024-09-18
    68.06
    best: 74.17 (GPT-4o)
    SOTA
    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any ResolutionarXiv:2409.12191
  • Visual Question Answering (VQA)onVLM2-Bench
    OC-grp· 2024-09-18
    35
    best: 57.5 (GPT-4o)
    SOTA
    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any ResolutionarXiv:2409.12191
  • Visual Question Answering (VQA)onVLM2-Bench
    PC-grp· 2024-09-18
    49
    best: 69 (Qwen2.5-VL-7B)
    SOTA
    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any ResolutionarXiv:2409.12191
  • Relation ExtractiononVinoground
    Group Score· 2024-09-18
    15.2
    best: 35 (GPT-4o (CoT))
    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any ResolutionarXiv:2409.12191
  • Relation ExtractiononVinoground
    Text Score· 2024-09-18
    40.2
    best: 59.2 (GPT-4o (CoT))
    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any ResolutionarXiv:2409.12191
  • Relation ExtractiononVinoground
    Video Score· 2024-09-18
    32.4
    best: 51 (GPT-4o (CoT))
    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any ResolutionarXiv:2409.12191
  • Question AnsweringonVNBench
    Accuracy· 2024-09-18
    33.9
    best: 77.88 (BIMBA-LLaVA-Qwen2-7B)
    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any ResolutionarXiv:2409.12191
  • Visual Question Answering (VQA)onVLM2-Bench
    GC-trk· 2024-09-18
    19.18
    best: 43.38 (Qwen2.5-VL-7B)
    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any ResolutionarXiv:2409.12191
  • Visual Question Answering (VQA)onVLM2-Bench
    OC-cnt· 2024-09-18
    45.99
    best: 80.62 (GPT-4o)
    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any ResolutionarXiv:2409.12191
  • Visual Question Answering (VQA)onVLM2-Bench
    PC-VID· 2024-09-18
    16.25
    best: 66.75 (GPT-4o)
    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any ResolutionarXiv:2409.12191
  • Visual Question Answering (VQA)onVLM2-Bench
    PC-cnt· 2024-09-18
    58.59
    best: 90.5 (GPT-4o)
    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any ResolutionarXiv:2409.12191
  • Visual Question Answering (VQA)onVLM2-Bench
    PC-cpr· 2024-09-18
    61.5
    best: 80 (Qwen2.5-VL-7B)
    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any ResolutionarXiv:2409.12191
  • Visual Question Answering (VQA)onMM-Vet
    GPT-4 score· 2024-09-18
    62
    best: 74.24 (MMCTAgent (GPT-4 + GPT-4V))
    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any ResolutionarXiv:2409.12191
  • Temporal Relation ExtractiononVinoground
    Group Score· 2024-09-18
    15.2
    best: 35 (GPT-4o (CoT))
    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any ResolutionarXiv:2409.12191
  • Temporal Relation ExtractiononVinoground
    Text Score· 2024-09-18
    40.2
    best: 59.2 (GPT-4o (CoT))
    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any ResolutionarXiv:2409.12191
  • Temporal Relation ExtractiononVinoground
    Video Score· 2024-09-18
    32.4
    best: 51 (GPT-4o (CoT))
    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any ResolutionarXiv:2409.12191
  • Visual Question AnsweringonMM-Vet
    GPT-4 score· 2024-09-18
    62
    best: 74.24 (MMCTAgent (GPT-4 + GPT-4V))
    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any ResolutionarXiv:2409.12191

Reasoning3 results

  • Video Question AnsweringonTVBench
    Average Accuracy· 2024-09-18
    43.8
    best: 63.6 (Seed1.5-VL thinking)
    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any ResolutionarXiv:2409.12191
  • Video Question AnsweringonVNBench
    Accuracy· 2024-09-18
    33.9
    best: 77.88 (BIMBA-LLaVA-Qwen2-7B)
    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any ResolutionarXiv:2409.12191
  • Natural Language Visual GroundingonScreenSpot
    Accuracy (%)· 2024-09-18
    42.1
    best: 86.34 (UGround-V1-7B)
    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any ResolutionarXiv:2409.12191