Papers With Code 2 | ML Benchmarks, SotA Results & Code

VLM²-Bench: Benchmarking Vision-Language Models on Visual Cue Matching

Description

VLM²-Bench is the first comprehensive benchmark designed to evaluate vision-language models' (VLMs) ability to visually link matching cues across multi-image sequences and videos. The benchmark consists of 9 subtasks with over 3,000 test cases, focusing on fundamental visual linking capabilities that humans use daily. A key example is identifying the same person across different photos without prior knowledge of their identity.

Through extensive evaluation of eight open-source VLMs and GPT-4o using various prompting techniques, we uncover significant challenges in visual cue linking. Even the best-performing model, GPT-4o, falls 34.80% below human-level performance. Our analysis highlights critical areas for improvement:

Enhancing core visual understanding with reduced reliance on prior knowledge.
Better integration of language reasoning within visual tasks.
Developing training approaches that improve independent visual relationship inference.

Dataset Characteristics

Size: 3,000+ test cases
Modalities: Text, image, video
Question Types: True/False, multiple-choice, numerical, open-ended
Generation Process: Semi-automated with human verification
Structure: Organized into three primary categories:
- General Cue (GC): Evaluates visual element tracking and matching.
- Object-centric Cue (OC): Focuses on object comparison, counting, and grouping.
- Person-centric Cue (PC): Measures the ability to compare, count, group, and describe individuals across frames.

Potential Use Cases

Benchmarking vision-language models (VLMs) for real-world multi-modal reasoning.
Evaluating visual linking abilities and spatial awareness in large models.
Analyzing weaknesses in object permanence and relational inference.
Providing insights for improving next-generation vision-language architectures.

Paper & Code

📄 Paper: VLM²-Bench: A Closer Look at How Well VLMs Implicitly Link Explicit Matching Visual Cues
📂 Code Repository: GitHub - vlm2-bench/VLM2-Bench

BibTeX Citation

@misc{zhang2025vlm2benchcloserlookvlms,
      title={VLM$^2$-Bench: A Closer Look at How Well VLMs Implicitly Link Explicit Matching Visual Cues}, 
      author={Jianshu Zhang and Dongyu Yao and Renjie Pi and Paul Pu Liang and Yi R. Fung},
      year={2025},
      eprint={2502.12084},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.12084}
}

VLM²-Bench: Benchmarking Vision-Language Models on Visual Cue Matching

Description

Enhancing core visual understanding with reduced reliance on prior knowledge.

Better integration of language reasoning within visual tasks.

Developing training approaches that improve independent visual relationship inference.

Dataset Characteristics

Size: 3,000+ test cases

Modalities: Text, image, video

Question Types: True/False, multiple-choice, numerical, open-ended

Generation Process: Semi-automated with human verification

Structure: Organized into three primary categories:

General Cue (GC): Evaluates visual element tracking and matching.
Object-centric Cue (OC): Focuses on object comparison, counting, and grouping.
Person-centric Cue (PC): Measures the ability to compare, count, group, and describe individuals across frames.

Potential Use Cases

Benchmarking vision-language models (VLMs) for real-world multi-modal reasoning.

Evaluating visual linking abilities and spatial awareness in large models.

Analyzing weaknesses in object permanence and relational inference.

Providing insights for improving next-generation vision-language architectures.

Paper & Code

BibTeX Citation

@misc{zhang2025vlm2benchcloserlookvlms, title={VLM$^2$-Bench: A Closer Look at How Well VLMs Implicitly Link Explicit Matching Visual Cues}, author={Jianshu Zhang and Dongyu Yao and Renjie Pi and Paul Pu Liang and Yi R. Fung}, year={2025}, eprint={2502.12084}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2502.12084} }