Papers With Code 2 | ML Benchmarks, SotA Results & Code

NPHardEval4V is a dynamic reasoning benchmark designed to evaluate the reasoning capabilities of Multimodal Large Language Models (MLLMs). Let me provide you with more details:

Purpose and Gap Addressed:
- The benchmark aims to address existing gaps in evaluating the pure reasoning abilities of MLLMs.
- It provides a venue to disentangle the effects of various factors (such as image recognition and instruction following) from the overall performance of the models.
- By focusing solely on reasoning abilities, NPHardEval4V helps researchers understand and guide further development in this area.
Construction and Features:
- NPHardEval4V is built by converting textual descriptions of questions from the existing NPHardEval dataset into image representations.
- Unlike traditional benchmarks that primarily focus on static evaluations, NPHardEval4V is dynamic. It is updated monthly to prevent overfitting and ensure authentic and fine-grained model evaluation.
- The benchmark evaluates MLLMs across three problem classes: polynomial time, NP-complete, and NP-hard problems.
- It assesses performance in three dimensions:
  - Recognition (RA): Ability to understand image and video modalities.
  - Instruction-following (ER): How well the model follows instructions.
  - Reasoning (AA): Pure reasoning abilities.
Findings and Impact:
- Significant discrepancies in reasoning abilities exist across different models.
- MLLMs exhibit relatively weak performance compared to Large Language Models (LLMs) in terms of reasoning.
- Investigating different prompting styles (visual, text, and combined) reveals varying impacts of multimodal inputs on model performance.

In summary, NPHardEval4V provides a valuable resource for assessing reasoning abilities in MLLMs and contributes to advancing research in this domain. 🌟