NPHardEval4V

Introduced 2024-03-04

NPHardEval4V is a dynamic reasoning benchmark designed to evaluate the reasoning capabilities of Multimodal Large Language Models (MLLMs). Let me provide you with more details:

  1. Purpose and Gap Addressed:

    • The benchmark aims to address existing gaps in evaluating the pure reasoning abilities of MLLMs.
    • It provides a venue to disentangle the effects of various factors (such as image recognition and instruction following) from the overall performance of the models.
    • By focusing solely on reasoning abilities, NPHardEval4V helps researchers understand and guide further development in this area.
  2. Construction and Features:

    • NPHardEval4V is built by converting textual descriptions of questions from the existing NPHardEval dataset into image representations.
    • Unlike traditional benchmarks that primarily focus on static evaluations, NPHardEval4V is dynamic. It is updated monthly to prevent overfitting and ensure authentic and fine-grained model evaluation.
    • The benchmark evaluates MLLMs across three problem classes: polynomial time, NP-complete, and NP-hard problems.
    • It assesses performance in three dimensions:
      • Recognition (RA): Ability to understand image and video modalities.
      • Instruction-following (ER): How well the model follows instructions.
      • Reasoning (AA): Pure reasoning abilities.
  3. Findings and Impact:

    • Significant discrepancies in reasoning abilities exist across different models.
    • MLLMs exhibit relatively weak performance compared to Large Language Models (LLMs) in terms of reasoning.
    • Investigating different prompting styles (visual, text, and combined) reveals varying impacts of multimodal inputs on model performance.

In summary, NPHardEval4V provides a valuable resource for assessing reasoning abilities in MLLMs and contributes to advancing research in this domain. 🌟