TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Datasets/VLM2-Bench

VLM2-Bench

VLM²-Bench

ImagesTextsVideosCC BY-NC 4.0 LicenseIntroduced 2025-02-17

VLM²-Bench: Benchmarking Vision-Language Models on Visual Cue Matching

Description

VLM²-Bench is the first comprehensive benchmark designed to evaluate vision-language models' (VLMs) ability to visually link matching cues across multi-image sequences and videos. The benchmark consists of 9 subtasks with over 3,000 test cases, focusing on fundamental visual linking capabilities that humans use daily. A key example is identifying the same person across different photos without prior knowledge of their identity.

Through extensive evaluation of eight open-source VLMs and GPT-4o using various prompting techniques, we uncover significant challenges in visual cue linking. Even the best-performing model, GPT-4o, falls 34.80% below human-level performance. Our analysis highlights critical areas for improvement:

  1. Enhancing core visual understanding with reduced reliance on prior knowledge.
  2. Better integration of language reasoning within visual tasks.
  3. Developing training approaches that improve independent visual relationship inference.

Dataset Characteristics

  • Size: 3,000+ test cases
  • Modalities: Text, image, video
  • Question Types: True/False, multiple-choice, numerical, open-ended
  • Generation Process: Semi-automated with human verification
  • Structure: Organized into three primary categories:
    • General Cue (GC): Evaluates visual element tracking and matching.
    • Object-centric Cue (OC): Focuses on object comparison, counting, and grouping.
    • Person-centric Cue (PC): Measures the ability to compare, count, group, and describe individuals across frames.

Potential Use Cases

  • Benchmarking vision-language models (VLMs) for real-world multi-modal reasoning.
  • Evaluating visual linking abilities and spatial awareness in large models.
  • Analyzing weaknesses in object permanence and relational inference.
  • Providing insights for improving next-generation vision-language architectures.

Paper & Code

📄 Paper: VLM²-Bench: A Closer Look at How Well VLMs Implicitly Link Explicit Matching Visual Cues
📂 Code Repository: GitHub - vlm2-bench/VLM2-Bench

BibTeX Citation

@misc{zhang2025vlm2benchcloserlookvlms,
      title={VLM$^2$-Bench: A Closer Look at How Well VLMs Implicitly Link Explicit Matching Visual Cues}, 
      author={Jianshu Zhang and Dongyu Yao and Renjie Pi and Paul Pu Liang and Yi R. Fung},
      year={2025},
      eprint={2502.12084},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.12084}
}

Benchmarks

Visual Question Answering (VQA)/GC-matVisual Question Answering (VQA)/GC-trkVisual Question Answering (VQA)/OC-cprVisual Question Answering (VQA)/OC-cntVisual Question Answering (VQA)/OC-grpVisual Question Answering (VQA)/PC-cprVisual Question Answering (VQA)/PC-cntVisual Question Answering (VQA)/PC-grpVisual Question Answering (VQA)/PC-VIDVisual Question Answering (VQA)/Average Score on VLM2-bench (9 subtasks)

Statistics

Papers
9
Benchmarks
10

Links

Homepage

Tasks

Video Question AnsweringVisual Question Answering (VQA)